Skip to content

Provenance and Citation in FAIR²

The FAIR² specification enables transparent, machine-readable tracking of provenance and citations at both the methodological and data transformation levels. This ensures reproducibility, trust, and responsible reuse.

Scope of Provenance Tracking

Target Element Purpose
Method Steps (HowToStep) To document the origin, tools, agents, and citations for each procedure.
RecordSets / Variables To track mappings, transformations, or derivations applied to specific variables or files.

1. Provenance in Method Descriptions

FAIR² extends schema:HowToStep with prov: properties and citation support. You may annotate each Step, Substep, or StepCase with:

  • prov:wasAssociatedWith: the agent (tool, person, organization) that performed or is responsible for the step.
  • prov:used: input files, tools, or parameters used.
  • prov:generated: output entities (optional).
  • prov:wasDerivedFrom: to describe derivation from prior methods or protocols.
  • schema:citation: to include a scholarly or persistent reference (DOI, paper, or dataset).

Example: Method Step with Provenance and Citation

{
  "@type": "HowToStep",
  "text": "Run AlphaFold prediction using ColabFold pipeline.",
  "prov:wasAssociatedWith": {
    "@type": "prov:SoftwareAgent",
    "name": "ColabFold v1.5",
    "identifier": "https://github.com/sokrypton/ColabFold"
  },
  "prov:used": [
    { "@type": "schema:Dataset", "name": "FASTA input", "identifier": "input.fasta" }
  ],
  "schema:citation": {
    "@type": "ScholarlyArticle",
    "name": "AlphaFold: Highly accurate protein structure prediction",
    "identifier": "https://doi.org/10.1038/s41586-021-03819-2"
  }
}

2. Provenance in RecordSets and Variable Transformations

To capture how variables were derived, mapped, or transformed, FAIR² supports the use of prov:wasDerivedFrom, prov:wasGeneratedBy, and schema:citation at the variable and RecordSet levels.

Use Cases

Case Recommended Pattern
Variable mapped to ontology prov:wasDerivedFrom + citation of the ontology
Normalization or scaling prov:wasGeneratedBy linked to prov:Activity describing the transformation
Derived variable prov:wasDerivedFrom other variable(s) or dataset(s)
Manual curation prov:wasAssociatedWith = person or tool
Scripted transformation Link to script or GitHub repository using prov:used or schema:codeRepository

Example: Transformed Variable in RecordSet

{
  "@type": "PropertyValue",
  "name": "scaled_binding_affinity",
  "value": "0.75",
  "unitText": "relative",
  "prov:wasDerivedFrom": {
    "@type": "PropertyValue",
    "name": "raw_binding_affinity",
    "value": "-8.2",
    "unitText": "kcal/mol"
  },
  "prov:wasGeneratedBy": {
    "@type": "prov:Activity",
    "name": "Z-score normalization",
    "description": "Normalized binding affinities across BA.1 variants",
    "prov:used": {
      "@type": "SoftwareSourceCode",
      "name": "normalize.py",
      "codeRepository": "https://github.com/example/lab-scripts"
    }
  },
  "schema:citation": {
    "@type": "CreativeWork",
    "name": "Normalized binding data",
    "identifier": "https://doi.org/10.5281/zenodo.1234567"
  }
}

3. Representing Agents and Activities

Agents and activities should be typed using the PROV Ontology:

PROV Type Description
prov:SoftwareAgent Software tool or script
prov:Person Individual contributor
prov:Organization Institutional contributor
prov:Activity Process applied to produce, transform, or derive data

These elements can be reused across steps, variables, and datasets to support consistent provenance graphs.


4. SHACL Validation Rules (Provenance)

To ensure consistency and validity, FAIR² SHACL includes these optional rules:

  • schema:HowToStep MAY include prov:wasAssociatedWith, prov:used, and schema:citation.
  • schema:PropertyValue SHOULD include prov:wasDerivedFrom or prov:wasGeneratedBy if name suggests derivation.
  • All prov:Activity entities MUST include name and SHOULD include description.

5. Citation Formats

  • All citations should use persistent identifiers where possible (doi, url, or identifier fields).
  • schema:citation accepts:
  • ScholarlyArticle
  • CreativeWork
  • Dataset
  • SoftwareSourceCode

By explicitly modeling provenance and citations, FAIR² enhances transparency, encourages responsible reuse, and enables reproducible AI workflows.