dataverse icon indicating copy to clipboard operation
dataverse copied to clipboard

Croissant 1.1

Open pdurbin opened this issue 1 month ago • 1 comments

We've been asked if we can support Croissant 1.1 at launch, even if only in a minimal way, such as bumping the version number.

For the actual code changes I've created this parent issue in the Croissant exporter repo:

  • https://github.com/gdcc/exporter-croissant/issues/25

I've created a number of sub issues and pull requests based on my analysis of the differences between the two specs. There are some backward incompatibilities:

  • https://github.com/gdcc/exporter-croissant/issues/32

We can use this issue (#12014) to add a release note snippet and any doc changes.

I'm not sure how to size this. I'll give it a 20 for now.

It's not super clear when the Croissant 1.1 spec will go live, but it might be announced at NeurIPS, which starts today.

Here's the Croissant 1.1 project board: https://github.com/orgs/mlcommons/projects/44/views/1

The current spec is 1.0 as of this writing: https://github.com/mlcommons/croissant/blob/main/docs/croissant-spec.md

The draft spec is 1.1 as of this writing: https://github.com/mlcommons/croissant/blob/main/docs/croissant-spec-draft.md

The 1.1 feature that probably interests us this most is this one:

  • https://github.com/mlcommons/croissant/issues/640

However, nothing has been merged into the spec yet.

Generally, my plan is:

  • wait for an announcement of the 1.1 spec
  • if summary stats is included and looks easy, implement it
  • otherwise, bump the version number from 1.0 to 1.1: https://github.com/gdcc/exporter-croissant/pull/31
  • probably include the backward-incompatible change: https://github.com/gdcc/exporter-croissant/pull/33
  • possibly do this cleanup: https://github.com/gdcc/exporter-croissant/pull/35
  • write it all up in the changelog
  • for the exporter itself, probably make a major version bump when releasing to reflect backward-incompatible changes

pdurbin avatar Dec 02 '25 19:12 pdurbin

I went to the Croissant Task Force meeting today. This isn't really recorded as such in the minutes, but here's my take:

  • The Croissant 1.1 spec is still open for tweaking until early next week. They plan to publish it around then.
  • The Croissant 1.1 spec won't be formally announced until January. I think this means that we probably have until then for our Croissant 1.1. implementation.
  • https://validator.schema.org still shows errors with 1.1 examples such as https://github.com/mlcommons/croissant/blob/7caffb6795928c304db48f3d8e0c1c482e142056/datasets/1.1/huggingface-squad_v2/metadata.json so I don't think https://github.com/mlcommons/croissant/issues/725 (has 1.1 label) has been fixed yet. I was told, "We need to figure out how to let the validator know that FileObject and FileSet extend DataDownload."
  • Summary stats ( https://github.com/mlcommons/croissant/issues/640 ) probably won't be ready for 1.1 so they might change or remove the 1.1 label for that issue.

These are the two PRs we went over that had spec updates in them:

  • https://github.com/mlcommons/croissant/pull/968
  • https://github.com/mlcommons/croissant/pull/969

The changes:

  • containedIn (which we don't use since we don't list the content of zip files)
  • external vocabularies (useful but we don't use them, yet)
  • provenance (same)
  • data use restrictions (same)

Now that we know that summary stats probably won't be included, I don't think there will be anything particularly useful for users in a 1.1 release we make. Again, there will be some cleanup that might compel me to make a major version bump. With the change from 1.0 to 1.1 and least we'll be advertising the latest version. And we can add more features later, especially as our users request them.

pdurbin avatar Dec 03 '25 20:12 pdurbin