FAIR² and ML Croissant Integration
Overview
FAIR² (FAIR Squared) builds directly on ML Croissant, extending its capabilities to ensure that datasets are both FAIR (Findable, Accessible, Interoperable, and Reusable) and AI-ready.
ML Croissant is a metadata standard for machine learning datasets developed by MLCommons. FAIR² enhances this foundation by adding:
- SHACL validation for structured dataset metadata
- Support for AI/ML methodologies using
MethodSectionShapeandMethodStepShape - Compliance features aligned with Responsible AI through explicit tracking of dataset provenance and usage
This document outlines how FAIR² builds upon and extends ML Croissant to provide machine-actionable metadata for AI workflows.
Required Properties in ML Croissant
| Property | Description | Type | Example |
|---|---|---|---|
@type |
Specifies the type of the dataset (typically Dataset) |
sc:Dataset |
"@type": "sc:Dataset" |
name |
The title of the dataset | xsd:string |
"name": "FAIR AI Benchmark Dataset" |
description |
Explanation of the dataset’s content and purpose | xsd:string |
"description": "A dataset for AI fairness" |
license |
Dataset license URI | xsd:anyURI |
"license": "https://creativecommons.org/licenses/by/4.0/" |
url |
Landing page or repository URL | xsd:anyURI |
"url": "https://example.com/dataset" |
distribution |
Description of downloadable files | Array |
Refer to FileObject structure below |
recordSet |
Defines logical data structure | Array |
Refer to RecordSet section |
FileObject (Inside distribution)
| Property | Description | Type | Example |
|---|---|---|---|
@type |
Specifies file type | cr:FileObject |
"@type": "cr:FileObject" |
@id |
Unique file identifier | xsd:string |
"@id": "file1" |
name |
Filename | xsd:string |
"name": "data.csv" |
contentUrl |
URL of hosted file | xsd:anyURI |
"contentUrl": "https://example.com/data.csv" |
encodingFormat |
File format (e.g., text/csv) | xsd:string |
"encodingFormat": "text/csv" |
sha256 |
File checksum | xsd:string |
"sha256": "abc123..." |
RecordSet (Inside recordSet)
| Property | Description | Type | Example |
|---|---|---|---|
@type |
Declares a record set | cr:RecordSet |
"@type": "cr:RecordSet" |
name |
Name of the record set | xsd:string |
"name": "User Data" |
description |
Textual summary of the records | xsd:string |
"description": "Demographic information" |
field |
List of field definitions | Array |
Refer to Field section |
Field (Inside field)
| Property | Description | Type | Example |
|---|---|---|---|
@type |
Field object | cr:Field |
"@type": "cr:Field" |
name |
Field name | xsd:string |
"name": "age" |
description |
Field explanation | xsd:string |
"description": "Age of participant" |
dataType |
Expected data type | sc:DataType |
"dataType": "sc:Integer" |
references |
How the field maps to a file | Object | "references": { "fileObject": "file1" } |
Extensions Provided by FAIR²
| ML Croissant Feature | FAIR² Enhancement |
|---|---|
| Dataset metadata | SHACL validation for compliance and consistency |
| AI methodology support | Structured method tracking for data processing and modeling |
| Schema.org compatibility | Alignment with linked data vocabularies |
| Provenance and licensing | Integration of provenance and citation tracking mechanisms |
FAIR² ensures that metadata not only meets FAIR requirements but also enables seamless integration with modern machine learning workflows.
Example: FAIR² Metadata with Croissant Extensions
{
"@context": [
"https://fair2.ai/ns/",
"https://mlcroissant.org/"
],
"@type": "Dataset",
"name": "AI-ready Dataset",
"description": "A dataset structured for AI and machine learning workflows.",
"license": "https://creativecommons.org/licenses/by/4.0/",
"cr:features": [
{
"@type": "Feature",
"name": "Image",
"dataType": "image/png"
},
{
"@type": "Feature",
"name": "Label",
"dataType": "string"
}
],
"cr:citeAs": "Doe, J. AI Dataset (2025)",
"fair2:method": {
"@type": "fair2:Section",
"name": "Data Preprocessing",
"step": [
{
"@type": "fair2:Step",
"name": "Normalization",
"description": "Rescaling image pixel values between 0 and 1."
}
]
}
}
Loading FAIR² Datasets in AI Frameworks
Datasets described using FAIR² and ML Croissant metadata can be loaded directly into AI frameworks:
from mlcroissant import Dataset
from torch.utils.data import DataLoader
dataset = Dataset("fair2.json")
dataloader = DataLoader(dataset)
for batch in dataloader:
images, labels = batch
# Training logic here
Validation of FAIR² Metadata
FAIR² uses SHACL to validate ML Croissant-based metadata:
pyshacl -s fair2_dataset.json -d mydata.json
Typical validation errors and solutions:
| Error Message | Cause | Recommended Fix |
|---|---|---|
Missing required property cr:citeAs |
Citation field not provided | Add "cr:citeAs": "Your citation" to metadata |
schema:distribution must be present |
No file references defined | Include at least one schema:distribution object |
Invalid datatype for schema:datePublished |
Incorrect date format | Use YYYY-MM-DD format |
Compatibility Rules
FAIR² metadata files MUST satisfy the rules below in order to load cleanly
with the mlcroissant Python library. These rules were established by
validating real FAIR² packages against mlcroissant's JSON-LD processor and
identifying reproducible crash modes.
Rule 1 — No shared-node @id references
A node whose @id appears as a top-level @graph member MUST NOT be
referenced from more than one other place via a bare {"@id": "X"} object.
The mlcroissant traversal function visits shared nodes twice and raises a
KeyError. To describe the same entity at multiple points in the graph,
embed it as a fresh blank-node object without @id at each usage site
(see Rule 4).
Rule 2 — No "@type": "@id" on cross-graph properties
Properties whose values reference other @graph entries MUST NOT carry
"@type": "@id" in the @context. The underlying rdflib JSON-LD parser
silently converts string values with this coercion back into bare
{"@id": "..."} dicts, reintroducing the shared-node crash described in
Rule 1. Affected properties include citation, dataArticle, dataPortal,
dataArchive, dataset, wasAssociatedWith, wasGeneratedBy,
wasDerivedFrom, wasRevisionOf, generated, next, and url.
Rule 3 — No circular self-references via @id
A Dataset node MUST NOT reference its own @id as the value of any of its
own properties. In particular, changeLog[].wasRevisionOf["@id"] MUST NOT
equal the Dataset's @id. To represent a revision that chains back to the
same logical entity without the cycle, embed the revision target as an
object without an @id, or omit the cross-reference entirely.
Rule 4 — Repeated entities MUST be blank nodes
When the same logical entity (an author, an organization, a software agent)
appears in multiple locations in the graph, each occurrence MUST be a fresh
object without @id. This is the corollary of Rule 1: without an @id,
the JSON-LD processor treats each occurrence as a distinct blank node and
the shared-node crash cannot be triggered.
Rule 5 — Distribution @id MUST match FileObject source exactly
The @id value of each distribution[] entry MUST exactly match the value
used in recordSet[].field[].source.fileObject["@id"]. Any mismatch causes
mlcroissant to silently drop the source mapping; the RecordSet will be
unloadable without any validation error.
Rule 6 — FileObject @type MUST be cr:FileObject
Distribution entries MUST declare "@type": "cr:FileObject" (which resolves
to http://mlcommons.org/croissant/FileObject). Legacy types such as
schema:DataDownload or sc:FileObject are silently ignored by
mlcroissant and the distribution will not be indexed.
Summary
FAIR² provides the following enhancements to ML Croissant:
- Structured metadata validation using SHACL
- Provenance tracking and citation support
- Interoperable format for use with AI training pipelines
- Integrated methodology documentation for reproducibility
For further details, refer to the FAIR² schema documentation and validator tools.