Lineage / provenance representation
Define a mechanism to describe lineage of data / provenance information.
This mechanism should support multiple levels of granularity:
- Dataset level
- RecordSet level
- Field level
- Row / data value level
We would ideally reuse existing vocabularies such as PROV-O.
We don't currently do it today, but this is related to a feature I'd love to see on Kaggle where datasets link to each other (or models) by showing which datasets are cleaned/etc. versions of others. Or which models were trained using which datasets. If we have a separate work stream to address this issue, please add me to any meetings/docs/etc.! I don't have the capacity to lead the effort right now, but very interested in participating.
Last Wednesday in the Croissant WG meeting I pitched exactly this idea -- I want to start with a croissant being able to refer to other croissants. For example, I would make croissants for all 105 Common Crawl crawls (see #762), and people publishing cleaned ML training sets would write croissants that pointed at CC crawl croissants.
Apparently DDI-CDI has this kind of thing built into it. That was the presentation last week at the Croissant meeting, talk was by Arofan Gregory and slides are https://docs.google.com/presentation/d/1-9sg1X8siHZCa4Zh_7vIOzreEz0-3d-d/edit?usp=sharing&ouid=104300626440986451054&rtpof=true&sd=true
This is an early proposal of the Croissant RAI subgroup for capturing data lineage with Croissant to gather insights from the community. The aim is to provide a way to capture and document the lineage of datasets.
Motivation: Datasets are frequently crafted by remixing data from various sources rather than being created from scratch. However, it is often unclear how the final dataset relates to its original components, leading to issues such as Frankenstein datasets where training and validation data might inadvertently overlap. Capturing the lineage (i.e., the origin and history of data transformations) is essential to ensure reproducibility, transparency, and trust in datasets.
Problem statement: Despite the importance of understanding a dataset’s lineage, they are currently not captured. This lack can lead to obscure:
- The relationship between the final dataset and its original sources.
- The specific modifications (e.g., filtering, deduplication, augmentation) applied during dataset creation.
- Potential risks, such as data contamination across splits, that may affect model validation and overall quality.
Our Proposed Mechanism:
Build on the existing Croissant framework and extending it to include lineage-specific attributes inspired by the PROV Ontology.
The aims is to represent lineage across multiple granularity levels:
- Dataset Level: Record overall provenance details such as original sources, creator(s), major transformations (e.g., filtering, deduplication), and licensing history.
- Resource Level: Capture lineage details for individual files or directories, including their extraction sources and any subsequent processing steps.
- RecordSet Level: For tabular data, document column-level lineage to track how specific fields or records have evolved.
- Annotation Level: Record changes in annotations (e.g., updates in labels, reannotations due to disagreements) to provide granular insight into how the dataset’s interpretations have changed over time.
To support synthetic data practices: Include metadata fields to capture details for synthetic data generation (e.g., generator model, prompting methodology, and source context) to differentiate between human-generated and machine-generated data.
Examples: will be added..
We are ready to try this feature out here at Common Crawl.
The situation is that FineWeb 🥂 datasets on Hugging Face 🤗 use 96 Common Crawl crawls, and so we'd syntax to somehow list 96 croissants 🥐 inside the FineWeb 🥂 croissant.
In terms of this proposal, I guess it would be at the dataset level?
cc @handecelikkanat
Hi, a quick update on the latest progress.
The proposal can apply not only at the dataset level but also at any level of the Structure Layer (e.g., recordSet and Field/rows).
For example, in the last comments on issue #885, I shared an example describing the provenance of the WildChat-1M dataset at the dataset level. Here, I’d like to show a different example that uses PROV-O inside the Structure Layer, to test how this works for FineWeb.
Take the BigBench dataset as an illustration: it is composed of three datasets (MMLU, OpenBookQA, and ARC). In the Croissant description, we define these three component datasets as entities. Then, in the Structure Layer, we declare a field that must be populated with one of those entities.
This way, provenance can be tracked at the row level: for instance, row 1000 could be derived from MMLU, while row 1001 could be derived from ARC. Note that this proposal uses the data-annotation level proposed that we built here: https://github.com/mlcommons/croissant/issues/737
{
"@context": {
"prov": "http://www.w3.org/ns/prov#",
},
"@type": "sc:Dataset",
"provenance":{
"prov:wasDerivedFrom":
[{
"@type": "prov:Entity",
"@id": "MMLU",
“@url”: “urn:dataset:MMLU”
},
{
"@type": "prov:Entity",
"@id": "ARC",
"@url": "urn:dataset:ARC",
},
{
"@type": "prov:Entity",
"@id": "openBookQA,
"@url": "urn:dataset:OpenBookQA",
}],
"peov:wasGeneratedBy":
[{
"@type": "prov:Activity",
"@id": "acitivity:dataExtraction",
]}
"peov:wasAssociatedWith": [{
"@type": "prov:Agent",
"name": "BIG-bench working group",
}]
}]
}
"recordSet": [
{
"@type": "cr:RecordSet",
"@id": "bigbench",
"field": [...]
"annotations":{
"@type": "cr:Field",
"@id": "binbech/wasDerivedFrom",
“equivalentProperty”: [“prov:wasDerivedFrom”],
"dataType": "prov:Entity",
]}
"conformsTo": "http://mlcommons.org/croissant/1.0",
"name": "BigBench Dataset,
}
Also, I in the WildChat-1M example on the issue #885 you can see how we are proposing to represent Agents, and Software Agents (this could be interesting for you to represent the software agents involved in data crawling).