croissant icon indicating copy to clipboard operation
croissant copied to clipboard

Improve support for manifest files (Common crawl example)

Open benjelloun opened this issue 5 months ago • 8 comments

From email thread:

I think we can represent CC data by treating the paths.gz as a manifest that "contains" the fileset. This is described vaguely in the 1.0 spec:

"A FileSet is a set of files located in a container, which can be an archive FileObject or a "manifest" file."

We should improve the description in the 1.1 spec, and give an example

You can extend the description you have created by adding a "containedIn" relationship between the FileObject and the corresponding FileSet:

"distribution": [
    {
      "@type": "cr:FileObject",
      "@id": "warc.paths.gz",
       ...
    },
    {
      "@type": "cr:FileSet",
      "@id": "warc_paths",
      "containedIn": "warc.paths.gz",
      ...
    },
    {
      "@type": "cr:FileObject",
      "@id": "wat.paths.gz",
       ...
    },
    {
      "@type": "cr:FileSet",
      "@id": "wat_paths",
      "containedIn": "wat.paths.gz",
      ...
    },
    ...
  ]

We are stretching things a bit by having the contained be both an gz archive and a manifest, but that seems reasonable to me. We could add some syntax to make things more explicit if needed.

benjelloun avatar Jul 29 '25 09:07 benjelloun

One thing that's missing is that our filenames in warc.paths.gz are relative to the bucket. There are 2 possible prefixes: https://data.commoncrawl.org/ and s3://commoncrawl/

This is a pretty common pattern for data that will be processed by Spark.

How can these prefixes be represented?

cc @handecelikkanat

wumpus avatar Aug 07 '25 00:08 wumpus

Also see https://github.com/mlcommons/croissant/issues/961

benjelloun avatar Oct 16 '25 15:10 benjelloun

I would like to add a more explicit mechanism to represent what is happening when a manifest file is provided in Croissant. I outline a proposal in the draft spec (See https://github.com/mlcommons/croissant/pull/968/).

Here are the main ideas:

  • In FileSet / FileObject, allow the containedIn property to point to a DataSource, which provides more structure than a direct link to a FileSet / FileObject
  • specify the container in the fileSet / fileObject property of the DataSource
  • Use the transform property of the DataSource to describe the operations performed on the container:
    • useunArchive when the container is an archive (this can be true by default for archive mime types.)
    • use readLines when the container is a manifest file

Here is a full example:

{
  "@type": "cr:FileObject",
  "@id": "manifest.zip",
  "contentUrl": "http://example.com/manifest.zip",
  "encodingFormat": "application/zip"
},
{
  "@type": "cr:FileSet",
  "@id": "my-files",
  "containedIn": {
    "fileObject": { "@id": "manifest.zip" },
    "transform": { "unArchive": true, "readLines": true }
  }
}

benjelloun avatar Nov 21 '25 17:11 benjelloun

These look very good, very clear, thank you @benjelloun ! Ill update CCF croissant draft accordingly.

Following up with size of the FileSet: How do you think is best to represent?

The use case we have is a set of files whose total is the important information - eg., the manifest FileObject points to a FileSet, which is 80GB in total (distributed over many files).

  • Is this important to represent? (I think so, or no?)
  • If yes: Where?
    • In the manifest FileObject?
    • Or in FileSet? (I think current FileSet doesnt support this, correct?)

@wumpus FYI.

handecelikkanat avatar Nov 23 '25 10:11 handecelikkanat

On Sun, Nov 23, 2025 at 11:12 AM handecelikkanat @.***> wrote:

handecelikkanat left a comment (mlcommons/croissant#920) https://github.com/mlcommons/croissant/issues/920#issuecomment-3567769606

These look very good, very clear, thank you @benjelloun https://github.com/benjelloun ! Ill update CCF croissant draft accordingly.

Following up with size of the FileSet: How do you think is best to represent?

The use case we have is a set of files whose total is the important information - eg., the manifest FileObject points to a FileSet, which is 80GB in total (distributed over many files).

  • Is this important to represent? (I think so, or no?)
  • If yes: Where?
    • In the manifest FileObject?
    • Or in FileSet? (I think current FileSet doesnt support this, correct?)

I think this information should go on the FileSet. At the moment FileSet extends from sc:Intangible so it's missing the right properties. It seems reasonable to make it a subclass of sc:CreativeWork, like FileObject, so that you can use size on it. What do you think?

Best, Omar

@wumpus https://github.com/wumpus FYI.

— Reply to this email directly, view it on GitHub https://github.com/mlcommons/croissant/issues/920#issuecomment-3567769606, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMV3YRFWLAY4BWRPOQFHE336GCCPAVCNFSM6AAAAACCTLIZWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKNRXG43DSNRQGY . You are receiving this because you were mentioned.Message ID: @.***>

benjelloun avatar Nov 24 '25 14:11 benjelloun

@benjelloun Thank you for the swift reply!

On size property for FileSets

This makes sense to me that both FileSet and FileObject inherit from the same class, and solves our problem. Thank you very much!

Lets do this! Could we add to 1.1 already?

Trying to understand how DataSource plays in the manifest example

  • In the example I dont see DataSource type being used.
    • (In 1.1 draft: DataSource is same-level object with FileSet and FileObject, right?)
  • Should it be used somewhere? Specifically, these following sound to me like it should be used, eg. to encapsulate the FileSet and FileObject 🤔
    • "In FileSet / FileObject, allow the containedIn property to point to a DataSource, which provides more structure..."
    • "specify the container in the fileSet / fileObject property of the DataSource"

Or maybe it is that now FileSet and FileObject are used like DataSource, and I misunderstand :)

handecelikkanat avatar Nov 24 '25 16:11 handecelikkanat

I'm actually making FileSet and FileObject both extend from sc:DataDownload, which also fixes another problem: making the "distribution" field valid for schema.org (#725).

In my example above, the type of the containedIn property on the FileSet is DataSource. This is what allows us to provide values for fileObject and transform. Maybe I should have written it more clearly as:

{
  "@type": "cr:FileObject",
  "@id": "manifest.zip",
  "contentUrl": "http://example.com/manifest.zip",
  "encodingFormat": "application/zip"
},
{
  "@type": "cr:FileSet",
  "@id": "my-files",
  "containedIn": {
    "@type": "cr:DataSource",
    "fileObject": { "@id": "manifest.zip" },
    "transform": { "unArchive": true, "readLines": true }
  }
}

In general DataSource is not at the same level as FileObject or FileSet. It is used inside the Field of a RecordSet to point to its data source via the source property. We are here extending its use to the containedIn property of FileSet (and FileObject if that makes sense.)

This also allows you to do things like extract a list of files from a column of a CSV file for instance.

Does this make sense to you?

Best, Omar

benjelloun avatar Nov 24 '25 16:11 benjelloun

@benjelloun Thank you very much, it is very clear now :)

I really like the DataSource inheritence idea, great idea :) 👏 Very elegant!

I wonder if it can solve the issue @wumpus mentioned above, using transform:

One thing that's missing is that our filenames in warc.paths.gz are relative to the bucket. There are 2 possible prefixes: https://data.commoncrawl.org/ and s3://commoncrawl/ This is a pretty common pattern for data that will be processed by Spark. How can these prefixes be represented?

To expand on this:

  • Issue: Our paths as described in the manifest files have only partial prefixes, which are relative to the bucket. This prefix need to be expanded on read.
  • Solution?: Can we use a fixed mapping from partial prefix to full prefix as a transform property?
  • Further issue: A further problem is that we have 2 potential prefix expansions, one in HTTP and one in S3 protocol. Im not sure if we can represent an EITHER/OR relation? 🤔

handecelikkanat avatar Nov 25 '25 10:11 handecelikkanat