croissant icon indicating copy to clipboard operation
croissant copied to clipboard

Draft zip proposal for extended CCF crawl croissants + Provenance mockup

Open handecelikkanat opened this issue 1 month ago • 4 comments

@benjelloun cc. @wumpus

Sharing a draft zip file as followup to https://github.com/mlcommons/croissant/issues/961

CCF_crawl_croissants_and_provenance_mockup.zip

Zip file includes:

  • 117 croissant drafts, one for each of our crawls.
  • 1 mockup example for provenance citation to our crawls
    • This kind of hierarchy doesnt exist in our crawls, so we wont actually have this file in CCF, but a mockup for datasets referring to CCF.

We would like feedback especially on:

  • How we are using provenance

    • eg. I havent used id's because these are not referred to in the same croissant
    • but is it valid/important to use id's to refer to other croissants?
  • in how we use "distribution" with FileObjects and FileSets

    • They include a bunch of FileObjects that act as manifest files - including paths to files included in the related FileSet
      • eg., warc.paths.gz FileObject pointing to
    • And one additional FileObject example that keeps the data itself, so just a FileObject

Please let us know if anything looks awry!

Changes since https://github.com/mlcommons/croissant/issues/961:

  • New FileObject added: {crawl_id}.domains-top-1000 (crawls > 2012)
  • Switched to using MAJOR.MINOR.PATCH also for build version: 1.0.0+1.0.0

handecelikkanat avatar Nov 08 '25 11:11 handecelikkanat

@benjelloun can this be reviewed before 1.1 ships?

wumpus avatar Nov 20 '25 17:11 wumpus

The manifest representation looks good to me.

Small suggestions for the provenance mock-up:

  • the Dataset should have types: sc:Dataset and prov:Entity, since we're using prov properties on it.
  • the values of prov:wasDerivedFrom should also be typed as sc:Dataset.
  • I think it's fine to use URLs of datasets to refer to external datasets. If the dataset is a top-level object there, then it's not necessary to specify an id.
  • Ideally the pages pointed by those URLs should contain embedded Croissant metadata.

benjelloun avatar Nov 21 '25 14:11 benjelloun

Thank you @benjelloun !

I updated:

  • Croissants' types as: "@type": ["sc:Dataset", "prov:Entity"]

    • Question: As far as I understand both the referencer (top-level croissant) and the referenced (second-level croissants) need to be updated as prov:Entity, so both sides of the relation should be prov:Entity, is this correct?
  • and the reported types in prov:wasDerivedFrom as:

  "prov:wasDerivedFrom": [
    {
      "@type": ["sc:Dataset", "prov:Entity"],
      "@url": "https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-47/index.html"
    },
   ...
]

If the dataset is a top-level object there, then it's not necessary to specify an id.

  • I think this will be the case, but I will check, thank you.

Ideally the pages pointed by those URLs should contain embedded Croissant metadata.

  • I think this will also be the case, yes. Adding to check.

@wumpus FYI.

handecelikkanat avatar Nov 23 '25 10:11 handecelikkanat

  • Question: As far as I understand both the referencer (top-level croissant) and the referenced (second-level croissants) need to be updated as prov:Entity, so both sides of the relation should be prov:Entity, is this correct?
  • and the reported types in prov:wasDerivedFrom as:

That's correct, but we generally don't fetch the referenced croissant when validating the referencer, so it's probably okay to omit the prov:Entity type on the referenced dataset.

benjelloun avatar Nov 24 '25 13:11 benjelloun