FAIR² and ML Croissant Integration
Overview
FAIR² (FAIR Squared) builds directly on ML Croissant, extending its capabilities to ensure that datasets are both FAIR (Findable, Accessible, Interoperable, and Reusable) and AI-ready.
ML Croissant is a metadata standard for machine learning datasets developed by MLCommons. FAIR² enhances this foundation by adding:
- SHACL validation for structured dataset metadata
- Support for AI/ML methodologies using
MethodSectionShapeandMethodStepShape - Compliance features aligned with Responsible AI through explicit tracking of dataset provenance and usage
This document outlines how FAIR² builds upon and extends ML Croissant to provide machine-actionable metadata for AI workflows.
Required Properties in ML Croissant
| Property | Description | Type | Example |
|---|---|---|---|
@type |
Specifies the type of the dataset (typically Dataset) |
sc:Dataset |
"@type": "sc:Dataset" |
name |
The title of the dataset | xsd:string |
"name": "FAIR AI Benchmark Dataset" |
description |
Explanation of the dataset’s content and purpose | xsd:string |
"description": "A dataset for AI fairness" |
license |
Dataset license URI | xsd:anyURI |
"license": "https://creativecommons.org/licenses/by/4.0/" |
url |
Landing page or repository URL | xsd:anyURI |
"url": "https://example.com/dataset" |
distribution |
Description of downloadable files | Array |
Refer to FileObject structure below |
recordSet |
Defines logical data structure | Array |
Refer to RecordSet section |
FileObject (Inside distribution)
| Property | Description | Type | Example |
|---|---|---|---|
@type |
Specifies file type | cr:FileObject |
"@type": "cr:FileObject" |
@id |
Unique file identifier | xsd:string |
"@id": "file1" |
name |
Filename | xsd:string |
"name": "data.csv" |
contentUrl |
URL of hosted file | xsd:anyURI |
"contentUrl": "https://example.com/data.csv" |
encodingFormat |
File format (e.g., text/csv) | xsd:string |
"encodingFormat": "text/csv" |
sha256 |
File checksum | xsd:string |
"sha256": "abc123..." |
RecordSet (Inside recordSet)
| Property | Description | Type | Example |
|---|---|---|---|
@type |
Declares a record set | cr:RecordSet |
"@type": "cr:RecordSet" |
name |
Name of the record set | xsd:string |
"name": "User Data" |
description |
Textual summary of the records | xsd:string |
"description": "Demographic information" |
field |
List of field definitions | Array |
Refer to Field section |
Field (Inside field)
| Property | Description | Type | Example |
|---|---|---|---|
@type |
Field object | cr:Field |
"@type": "cr:Field" |
name |
Field name | xsd:string |
"name": "age" |
description |
Field explanation | xsd:string |
"description": "Age of participant" |
dataType |
Expected data type | sc:DataType |
"dataType": "sc:Integer" |
references |
How the field maps to a file | Object | "references": { "fileObject": "file1" } |
Extensions Provided by FAIR²
| ML Croissant Feature | FAIR² Enhancement |
|---|---|
| Dataset metadata | SHACL validation for compliance and consistency |
| AI methodology support | Structured method tracking for data processing and modeling |
| Schema.org compatibility | Alignment with linked data vocabularies |
| Provenance and licensing | Integration of provenance and citation tracking mechanisms |
FAIR² ensures that metadata not only meets FAIR requirements but also enables seamless integration with modern machine learning workflows.
Example: FAIR² Metadata with Croissant Extensions
{
"@context": [
"https://fair2.ai/ns/",
"https://mlcroissant.org/"
],
"@type": "Dataset",
"name": "AI-ready Dataset",
"description": "A dataset structured for AI and machine learning workflows.",
"license": "https://creativecommons.org/licenses/by/4.0/",
"cr:features": [
{
"@type": "Feature",
"name": "Image",
"dataType": "image/png"
},
{
"@type": "Feature",
"name": "Label",
"dataType": "string"
}
],
"cr:citeAs": "Doe, J. AI Dataset (2025)",
"fair2:method": {
"@type": "fair2:Section",
"name": "Data Preprocessing",
"step": [
{
"@type": "fair2:Step",
"name": "Normalization",
"description": "Rescaling image pixel values between 0 and 1."
}
]
}
}
Loading FAIR² Datasets in AI Frameworks
Datasets described using FAIR² and ML Croissant metadata can be loaded directly into AI frameworks:
from mlcroissant import Dataset
from torch.utils.data import DataLoader
dataset = Dataset("fair2.json")
dataloader = DataLoader(dataset)
for batch in dataloader:
images, labels = batch
# Training logic here
Validation of FAIR² Metadata
FAIR² uses SHACL to validate ML Croissant-based metadata:
pyshacl -s fair2_dataset.json -d mydata.json
Typical validation errors and solutions:
| Error Message | Cause | Recommended Fix |
|---|---|---|
Missing required property cr:citeAs |
Citation field not provided | Add "cr:citeAs": "Your citation" to metadata |
schema:distribution must be present |
No file references defined | Include at least one schema:distribution object |
Invalid datatype for schema:datePublished |
Incorrect date format | Use YYYY-MM-DD format |
Summary
FAIR² provides the following enhancements to ML Croissant:
- Structured metadata validation using SHACL
- Provenance tracking and citation support
- Interoperable format for use with AI training pipelines
- Integrated methodology documentation for reproducibility
For further details, refer to the FAIR² schema documentation and validator tools.