Draft zip proposal for extended CCF crawl croissants + Provenance mockup
@benjelloun cc. @wumpus
Sharing a draft zip file as followup to https://github.com/mlcommons/croissant/issues/961
CCF_crawl_croissants_and_provenance_mockup.zip
Zip file includes:
- 117 croissant drafts, one for each of our crawls.
- 1 mockup example for provenance citation to our crawls
- This kind of hierarchy doesnt exist in our crawls, so we wont actually have this file in CCF, but a mockup for datasets referring to CCF.
We would like feedback especially on:
-
How we are using provenance
- eg. I havent used id's because these are not referred to in the same croissant
- but is it valid/important to use id's to refer to other croissants?
-
in how we use
"distribution"with FileObjects and FileSets- They include a bunch of FileObjects that act as manifest files - including paths to files included in the related FileSet
- eg.,
warc.paths.gzFileObject pointing to
- eg.,
- And one additional FileObject example that keeps the data itself, so just a FileObject
- They include a bunch of FileObjects that act as manifest files - including paths to files included in the related FileSet
Please let us know if anything looks awry!
Changes since https://github.com/mlcommons/croissant/issues/961:
- New FileObject added:
{crawl_id}.domains-top-1000(crawls > 2012) - Switched to using MAJOR.MINOR.PATCH also for build version:
1.0.0+1.0.0
@benjelloun can this be reviewed before 1.1 ships?
The manifest representation looks good to me.
Small suggestions for the provenance mock-up:
- the Dataset should have types: sc:Dataset and prov:Entity, since we're using prov properties on it.
- the values of prov:wasDerivedFrom should also be typed as sc:Dataset.
- I think it's fine to use URLs of datasets to refer to external datasets. If the dataset is a top-level object there, then it's not necessary to specify an id.
- Ideally the pages pointed by those URLs should contain embedded Croissant metadata.
Thank you @benjelloun !
I updated:
-
Croissants' types as:
"@type": ["sc:Dataset", "prov:Entity"]- Question: As far as I understand both the referencer (top-level croissant) and the referenced (second-level croissants) need to be updated as prov:Entity, so both sides of the relation should be prov:Entity, is this correct?
-
and the reported types in
prov:wasDerivedFromas:
"prov:wasDerivedFrom": [
{
"@type": ["sc:Dataset", "prov:Entity"],
"@url": "https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-47/index.html"
},
...
]
If the dataset is a top-level object there, then it's not necessary to specify an id.
- I think this will be the case, but I will check, thank you.
Ideally the pages pointed by those URLs should contain embedded Croissant metadata.
- I think this will also be the case, yes. Adding to check.
@wumpus FYI.
- Question: As far as I understand both the referencer (top-level croissant) and the referenced (second-level croissants) need to be updated as prov:Entity, so both sides of the relation should be prov:Entity, is this correct?
- and the reported types in
prov:wasDerivedFromas:
That's correct, but we generally don't fetch the referenced croissant when validating the referencer, so it's probably okay to omit the prov:Entity type on the referenced dataset.