FAIR² Data Dictionary

Specification and Implementation Guide

1. Purpose and Scope

The FAIR² Core Data Dictionary defines the minimum interoperable metadata model for describing dataset variables in a way that is:

Human-readable
Machine-actionable
Mappable to existing standards (REDCap, ISO/IEC 11179, DDI, DCAT, ML Croissant)
Compatible with spreadsheet (CSV/Excel) workflows
Extensible without breaking backward compatibility

The Core profile intentionally avoids domain ontologies, UI logic, or mandatory statistics, while remaining ready to accommodate them.

2. Conceptual Model

Each variable is treated as a first-class entity that has:

A technical identifier
A human-readable name
A semantic definition
A data type and optional value domain
Provenance links to the method step(s) that produced it
Optional descriptive statistics with provenance

Variables are attached to a dataset using the standard Schema.org property:

schema:variableMeasured

3. Terminology (Normative)

3.1 Technical Identifier

Definition: Machine-facing, stable identifier used in files, schemas, and code.
Mapping: schema:identifier
Source (UI): variable_name
Requirements:
Unique within the dataset
Stable across versions

3.2 Human Label

Definition: Human-readable name shown in UIs and documentation.
Mapping: schema:name
Source (UI): variable

3.3 Semantic Definition

Definition: Precise, unambiguous statement of what the variable means.
Mapping: skos:definition
Source (UI): description
Normative rule: In FAIR² Core, the UI field description IS the semantic definition.

3.4 Usage Description (Optional)

Definition: Contextual or interpretive notes about usage.
Mapping: schema:description
Status: Optional; may be added without breaking Core compliance.

4. Core Properties

4.1 Required (FAIR²-Core)

Property	Vocabulary	Meaning
`@id`	RDF	Persistent variable identifier
`schema:identifier`	schema.org	Technical name
`schema:name`	schema.org	Human label
`skos:definition`	SKOS	Semantic definition
`cr:dataType`	ML Croissant	Primitive data type
`prov:wasGeneratedBy`	PROV-O	Method step(s) provenance (one or more)
`qudt:unit`	QUDT	Unit of measurement (IRI reference)
`schema:unitCode`	schema.org	Unit symbol/code

4.2 Strongly Recommended (FAIR²-Core+)

Property	Vocabulary	Meaning
`fair2:format`	FAIR²	Expected format/pattern
`fair2:valueDomain`	FAIR²	Permitted values or ranges
`fair2:missingValueCode`	FAIR²	Explicit missing codes
`fair2:exampleValue`	FAIR²	Example value

5. Statistics Model (Optional but Supported)

Statistics are derived metadata about variables. They are:

Optional
Repeatable
Provenance-bearing
Variable-specific

Statistics MUST NOT be encoded as fixed columns in the Core dictionary.

5.1 JSON-LD Statistics Pattern (Normative)

Statistics are represented as a schema:DescriptiveStatistics object attached to a variable.

"fair2:statistics": {
  "@type": "fair2:DescriptiveStatistics",
  "schema:variableMeasured": [
    { "@type": "schema:PropertyValue", "schema:name": "count", "schema:value": 4 },
    { "@type": "schema:PropertyValue", "schema:name": "unique", "schema:value": 4 },
    { "@type": "schema:PropertyValue", "schema:name": "missing_values", "schema:value": 36 }
  ],
  "prov:wasGeneratedBy": {
    "@type": "prov:Activity",
    "schema:name": "Pandas statistics computation",
    "prov:wasAssociatedWith": {
      "@type": "prov:SoftwareAgent",
      "schema:name": "Pandas",
      "schema:softwareVersion": "2.1.1",
      "schema:programmingLanguage": "Python",
      "schema:identifier": {
        "@id": "https://pandas.pydata.org/",
        "schema:propertyID": "URL"
      }
    }
  }
}

5.2 CSV / Excel Statistics Table (Normative)

Statistics MUST be exported as a separate long-form table.

Column	Description
variable_id	Variable identifier
statistic_name	Name of statistic (`count`, `mean`, `unique`, etc.)
statistic_value	Value (numeric or string)
statistic_unit	Unit (if applicable)
statistic_description	Optional notes
generated_by_activity	Provenance activity ID
software_name	Software used
software_version	Software version
programming_language	Language/environment

6. Canonical JSON-LD Example

{
  "@context": {
    "schema": "https://schema.org/",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    "prov": "http://www.w3.org/ns/prov#",
    "fair2": "https://fair2.ai/ns/",
    "cr": "http://mlcommons.org/croissant/",
    "qudt": "http://qudt.org/schema/qudt/",
    "unit": "http://qudt.org/vocab/unit/",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
  },
  "@id": "fair2:var:rmsd",
  "@type": ["fair2:Variable", "skos:Concept", "prov:Entity"],
  "schema:identifier": "rmsd",
  "schema:name": "RMSD",
  "skos:definition": "Root-mean-square deviation between two aligned molecular structures.",
  "cr:dataType": {"@id": "xsd:double"},
  "qudt:unit": {"@id": "unit:Angstrom"},
  "schema:unitCode": "A",
  "prov:wasGeneratedBy": [
    {"@id": "fair2:method:step2", "schema:name": "Structure alignment"},
    {"@id": "fair2:method:step3", "schema:name": "RMSD calculation"}
  ]
}

7. CSV / Excel Core Dictionary Columns

Column	Meaning
variable_id	Persistent variable identifier
technical_name	`schema:identifier`
human_label	`schema:name`
definition	`skos:definition`
value_type	`cr:dataType`
format	`fair2:format`
unit	`qudt:unit`
unit_code	`schema:unitCode`
method_step_ids	`prov:wasGeneratedBy` (semicolon-delimited if multiple)

8. Compliance Levels

Level	Requirements
FAIR²-Core	All required properties
FAIR²-Core+	Core + recommended properties
FAIR²-Extended	Core/Core+ + extension profiles

9. Normative Rules Summary

Variables MUST be enumerated using schema:variableMeasured
Technical identifiers MUST use schema:identifier
Human labels MUST use schema:name
Semantic definitions MUST use skos:definition
Data types MUST use cr:dataType (ML Croissant)
Units MUST use qudt:unit with a QUDT unit IRI
Provenance MUST use prov:wasGeneratedBy and MAY reference one or more method steps
FAIR² properties MUST be singular
Statistics MUST be optional and provenance-bearing

10. Design Rationale (Informative)

This specification:

Aligns with REDCap variable dictionaries
Matches ISO/IEC 11179 semantic expectations
Is compatible with ML Croissant field metadata
Supports FAIR and Responsible AI principles
Avoids premature ontology or analytics lock-in

End of FAIR² Data Dictionary Specification