croissant Draft zip proposal for extended CCF crawl croissants + Provenance mockup

@benjelloun cc. @wumpus

Sharing a draft zip file as followup to https://github.com/mlcommons/croissant/issues/961

CCF_crawl_croissants_and_provenance_mockup.zip

Zip file includes:

117 croissant drafts, one for each of our crawls.
1 mockup example for provenance citation to our crawls
- This kind of hierarchy doesnt exist in our crawls, so we wont actually have this file in CCF, but a mockup for datasets referring to CCF.

We would like feedback especially on:

How we are using provenance
- eg. I havent used id's because these are not referred to in the same croissant
- but is it valid/important to use id's to refer to other croissants?
in how we use "distribution" with FileObjects and FileSets
- They include a bunch of FileObjects that act as manifest files - including paths to files included in the related FileSet
  - eg., warc.paths.gz FileObject pointing to
- And one additional FileObject example that keeps the data itself, so just a FileObject

Please let us know if anything looks awry!

Changes since https://github.com/mlcommons/croissant/issues/961:

New FileObject added: {crawl_id}.domains-top-1000 (crawls > 2012)
Switched to using MAJOR.MINOR.PATCH also for build version: 1.0.0+1.0.0

Nov 08 '25 11:11 handecelikkanat

@benjelloun can this be reviewed before 1.1 ships?

Nov 20 '25 17:11 wumpus

The manifest representation looks good to me.

Small suggestions for the provenance mock-up:

the Dataset should have types: sc:Dataset and prov:Entity, since we're using prov properties on it.
the values of prov:wasDerivedFrom should also be typed as sc:Dataset.
I think it's fine to use URLs of datasets to refer to external datasets. If the dataset is a top-level object there, then it's not necessary to specify an id.
Ideally the pages pointed by those URLs should contain embedded Croissant metadata.

Nov 21 '25 14:11 benjelloun

Thank you @benjelloun !

I updated:

Croissants' types as: "@type": ["sc:Dataset", "prov:Entity"]
- Question: As far as I understand both the referencer (top-level croissant) and the referenced (second-level croissants) need to be updated as prov:Entity, so both sides of the relation should be prov:Entity, is this correct?
and the reported types in prov:wasDerivedFrom as:

  "prov:wasDerivedFrom": [
    {
      "@type": ["sc:Dataset", "prov:Entity"],
      "@url": "https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-47/index.html"
    },
   ...
]

If the dataset is a top-level object there, then it's not necessary to specify an id.

I think this will be the case, but I will check, thank you.

Ideally the pages pointed by those URLs should contain embedded Croissant metadata.

I think this will also be the case, yes. Adding to check.

@wumpus FYI.

Nov 23 '25 10:11 handecelikkanat

Question: As far as I understand both the referencer (top-level croissant) and the referenced (second-level croissants) need to be updated as prov:Entity, so both sides of the relation should be prov:Entity, is this correct?

and the reported types in prov:wasDerivedFrom as:

That's correct, but we generally don't fetch the referenced croissant when validating the referencer, so it's probably okay to omit the prov:Entity type on the referenced dataset.

Nov 24 '25 13:11 benjelloun