handecelikkanat
handecelikkanat
@benjelloun cc. @wumpus Sharing a draft zip file as followup to https://github.com/mlcommons/croissant/issues/961 [CCF_crawl_croissants_and_provenance_mockup.zip](https://github.com/user-attachments/files/23431479/CCF_crawl_croissants_and_provenance_mockup.zip) **Zip file includes:** - **117 croissant drafts**, one for each of our crawls. - **1 mockup example...
Here is an extended croissant draft for CCF crawls. Please give feedback. @benjelloun @wumpus FYI. ### What we dont have syntax for atm, and related issues: - Lineage: https://github.com/mlcommons/croissant/issues/738 -...
cc-crawl-statistics sometimes can report host counts as one more than actual number. This behavior is sporadic and doesnt always happen. Example: In `domains-top-500.csv` for `CC-MAIN-2025-30`: | domain | actual host...