E.g., to support https://github.com/mlcommons/croissant/issues/808

Jun 02 '25 16:06 benjelloun

This is a strawman proposal for using external vocabularies with Croissant. The goal is to support standards such as W3C PROV for provenance, DUO for data use restrictions, and other controlled vocabularies.

Properties from external vocabularies can be used at multiple levels in a Croissant description:

Dataset-level metadata.
Metadata at a finer level of granularity. E.g., FileObject, RecordSet, or Field.
Data-level annotations

Using external vocabularies with Croissant metadata

The general approach we use is based on schema.org's PropertyValue mechanism, which lets users specify the name and value of a property not defined in the current schema. PropertyValue also has a propertyID that can be used to identify the property, typically via its URL.

Here is a simple example, using the wasDerivedFrom property from the PROV vocabulary:

{
  "@type": "PropertyValue",
  "name": "wasDerivedFrom",
  "propertyID": "prov:wasDerivedFrom",
  "value": "https://huggingface.co/datasets/timm/imagenet-w21-wds"
    },

Assuming the prefix prov was mapped to the namespace http://www.w3.org/ns/prov# in the context, the propertyID will expand to the full URL http://www.w3.org/ns/prov#wasDerivedFrom.

Values of properties can be more complex that a simple text value or URL. This is supported in schema.org, as StructuredValue is also an allowed type for value. The value of the property should match the expected value in the source vocabulary. For example:

{
  "@type": "PropertyValue",
  "name": "wasGeneratedBy",
  "propertyID": "prov:wasGeneratedBy",
  "value": {
    "@type": "prov:Activity",
    "prov:label": "Reformulation and Multimodal Extension",
    "prov:used": { "@id": "urn:dataset:MMLU" },
    "prov:endedAtTime": "2023-05-01T00:00:00Z"
  }
}
{
  "@type": "PropertyValue",
  "name": "wasAttributedTo",
  "propertyID": "prov:wasAttributedTo",
  "value": {
    "@type": "prov:Agent",
    "prov:label": "MMLU/MMMU Curation Team"
  },
}

In the above example, the values of prov:wasGeneratedBy and prov:wasAttributedTo match the types prov:Activity and prov:Agent as defined in the PROV specification.

We now know how to represent properties from external vocabularies, but where do we put them? We propose two complementary mechanisms to add external properties to Croissant:

For well-known, important use cases such as provenance or usage restrictions, we will add container properties to Croissant:

Property	ExpectedType	Cardinality	Comments
provenance	PropertyValue	MANY	Provenance properties, generally specified using the W3C PROV vocabulary
useRestrictions	PropertyValue	MANY	Use restrictions properties, generally specified using the DUO vocabulary

For other use cases, we will adopt the "catch-all" container already provided by schema.org: additionalProperty.

These properties are available on the following Croissant types:

Dataset
FileObject
RecordSet
Field
Annotation

Note that this mechanism does not let us "annotate" Croissant properties using external vocabularies. For example, it's not possible to specify the provenance of the description property of a dataset.

Using external vocabularies with data

Using external vocabularies in the data of datasets is more straightforward.

First, There is already support for specifying the dataType of a Field to be a class from an external vocabulary. The values of that Field can then be instances of that class. For instance, cities can use the Wikidata vocabulary for City:

{
  "@id": "cities/url",
  "@type": "cr:Field",
  "dataType": ["https://schema.org/URL", "https://www.wikidata.org/wiki/Q515"]
}

External vocabularies can also be used at the level of a RecordSet, to type entire "rows". A datatype can specified a class from an external vocabulary, and each Field can be mapped to a corresponding property implicitly via its name, or explicitly via equivalentProperty. For instance, using the Wikidata City type again:

{
  "@id": "cities",
  "@type": "cr:RecordSet",
  "dataType": "wd:Q515",
  "field": [
    {
      "@id": "cities/flag_image",
      "@type": "cr:Field",
	  "equivalentProperty": "wd:Property:P41"
    },
    {
      "@id": "cities/country",
      "@type": "cr:Field",
	  "equivalentProperty": "wd:Property:P17"
    }
  ]
}

Finally, external vocabulary properties can also be associated with data-level annotations. For instance, summary statistics from the DDI summary statistic vocabulary can be associated with a particular column as follows:

{
  "@id": "movies",
  "@type": "cr:RecordSet",
  "field": [
    {
      "@id": "movies/stars",
      "@type": "cr:Field",
	  "annotation": {
        "@id": "movies/stars/arithmeticMean",
        "@type": "cr:Field",
	    "equivalentProperty": "http://rdf-vocabulary.ddialliance.org/cv/SummaryStatisticType/2.1.2/7975ed0",
	    "value": 4.2
      }
    },
  ]
}

Note that in this case, we did not specify a type at the container level (movies/stars). It's okay to omit the type when it's clear from the property itself. Also, this annotation has a constant value, which is the arithmetic mean of star ratings across the values of the field.

Aug 06 '25 13:08 benjelloun

So to clarify, the provenance and useRestrictions are properties on the types, describing croissant metadata (not existing properties).
Field is currently defined as a subclass of https://schema.org/Intangible. What is the relationship to the PropertyType schema properties being added? Is it similar to the FileObject - CreativeWork class where it uses properties but is not a hard-linked subclass?
Is there a way to determine if a Field.name is a context value vs. just a name? Colon presence?
Does the Field.equivilentProperty ONE property point the same type of thing (vocab context) as the "Field.dataType MANY" property? (It is good and should be kept, just checking datatypes)
Is schema.org's PropertyValue described much in the spec? Should info go under the id-and-reference-mechanism section?

Aug 06 '25 19:08 B13rg

I would suggest not to use PropertyValue as an triple-split indirection of using perfectly suitable RDF vocabularies directly in JSON-LD, as it means you have to do special processing to get back out the intended statements.

https://www.w3.org/submissions/prov-jsonld/ even suggests a context that can be used at the same time. In that case you just need to modify @context. The use of PropertyValue is intended not to build arbitrary triples, but to qualify existing measurements.

To link to provenance existing in a separate resource (could even be to a different named graph in the same JSON-LD) I would suggest using the http://www.w3.org/ns/prov#has_provenance term.

Full disclosure: I was one of the PROV-O authors.

Aug 07 '25 08:08 stain

@stain I like the idea of using the vocabularies directly, and using prov:has_provenance for external specification.

Just to make sure I got it right, below is a short write-up of what that would look like.

It would be good to work out a few examples with other external vocabularies to make sure this approach is good for them as well.

Instead of using PropertyValue, we can directly embed properties from other vocabularies. RDF / JSON-LD supports that. This approach is more direct, but provides a bit less structure on the Croissant side, which may make it harder to validate.

Let's revisit the previous examples to see what they would look like:

Simple dataset with prov:wasDerivedFrom:

{
  "@context": {
    "@language": "en",
    "@vocab": "https://schema.org/",
    ...
    "prov": "http://www.w3.org/ns/prov#",
    ...
  },
  "@type": "sc:Dataset",
  "name": "simple-pass",
  "conformsTo": "http://mlcommons.org/croissant/1.1",
  "prov:wasDerivedFrom": "https://huggingface.co/datasets/timm/imagenet-w21-wds"
}

More structured properties (omitting context):

{
  "@type": "sc:Dataset",
  "name": "simple-pass",
  "conformsTo": "http://mlcommons.org/croissant/1.1",
  "prov:wasGeneratedBy": {
    "@type": "prov:Activity",
    "prov:label": "Reformulation and Multimodal Extension",
    "prov:used": { "@id": "urn:dataset:MMLU" },
    "prov:endedAtTime": "2023-05-01T00:00:00Z"
  },
  "prov:wasAttributedTo": {
    "@type": "prov:Agent",
    "prov:label": "MMLU/MMMU Curation Team"
  },
}

This definitely looks more readable, and very natural for adding provenance information. There is no "grouping" of provenance under an umbrella attribute, but that's not a problem in my opinion.

In terms of where such properties can be used, there are no specific constraints, but the Croissant spec can provide recommendations on when/where to use specific vocabularies.

Using external vocabularies to describe Croissant data would not be affected by this change, and continue to work as described above.

Aug 07 '25 12:08 benjelloun

So to clarify, the provenance and useRestrictions are properties on the types, describing croissant metadata (not existing properties).

Yes, in this proposal they would be new added properties.

Field is currently defined as a subclass of https://schema.org/Intangible. What is the relationship to the PropertyType schema properties being added? Is it similar to the FileObject - CreativeWork class where it uses properties but is not a hard-linked subclass?

I'm not sure I follow... Where is PropertyType defined?

Is there a way to determine if a Field.name is a context value vs. just a name? Colon presence?

Field.name is just a name. @id is the identifier. (There may be some remnants of an earlier version of the spec where name played the role of an identifier.)

Does the Field.equivilentProperty ONE property point the same type of thing (vocab context) as the "Field.dataType MANY" property? (It is good and should be kept, just checking datatypes)

No, equivalentProperty points to a property, while dataType points to a class.

Is schema.org's PropertyValue described much in the spec? Should info go under the id-and-reference-mechanism section?

Not at this point. It's described in schema.org. If we end up using this mechanism, then we should provide some description in the Croissant spec and point to the schema.org documentation.

Aug 07 '25 12:08 benjelloun

Also, agree on using the vocabulary directly. This is the approach we have been using to capture provenance information.

On the other hand, high-level attributes may be needed. For instance, in the case of provenance, with Prov-O we are capturing the data lineage, the activities that generated the data, and the actors involved. However, we might fall short if we want to adopt in the future other types of provenance information as the content attribution mechanism proposed by C2PA, which involves provenance information that is cryptographically bound to the data point. Additionally, for data access conditions use case, it can make the Croissant description more readable.

Note that it is not a strong opinion, just to have proper context to take the decision.

On the other side, to provide another example of the Provenance use case. Here is an example of the provenance of the WildChat-1M dataset using PROV-O. The diagram presents an quick hint about what are we describing in the Croissant description.

Note that the high-level "provenance" attribute could be removed.

{
 "@context": {
     "prov": "http://www.w3.org/ns/prov#",
 },
 "@type": "sc:Dataset",
 "provenance":{
     "prov:wasDerivedFrom": {
        "@type": "prov:Entity",
        "@id": "[WildChat-1M-Full](https://huggingface.co/datasets/allenai/WildChat-1M-Full)",
        “url”: “"https://huggingface.co/datasets/allenai/WildChat-1M-Full",
        "name": "WildChat-1M-Full Dataset",
        "prov:generatedAtTime": "2024-05-01"
      },
      "prov:wasGeneratedBy": [
         {
           "@type": "prov:Activity",
           "@id": "journalismPIIremoval",
           "prov:description":"Conversations flagged by Niloofar Mireshghallah and her collaborators in 'Breaking News: Case Studies of Generative AI's Use in Journalism' for containing PII or sensitive information have been removed from this version of the dataset.",
           "prov:startedAtTime": "2024-10-17",
           "prov:wasAssociatedWith": [{
                  "@type": "prov:Entity",
                  "prov:name": "Case Studies of Generative AI's Use Journalism",
                  “prov:url": "https://arxiv.org/abs/2405.01470"
               },
               {
                  "@type": "prov:Person",
                  "prov:name": "Niloofar Mireshghallah",
               }]
          },
          {
            "@type": "prov:Activity",
            "@id": "acitivity:toxicContentRemoval",
            "prov:description":"All toxic conversations identified by the OpenAI Moderations API or Detoxify have been removed from this version of the dataset.",
             "prov:startedAtTime": "2024-07-22",
             "prov:wasAssociatedWith": [{
                        "@type": "prov:SoftwareAgent",
                        "name": "OpenAI Moderation API"
              }]
            },
            {
                "@type": "prov:Activity",
                "@id": "acitivity:PIIremoval",
                "prov:description":"The data has been de-identified with Microsoft Presidio and hand-written rules by the authors.",
                "wasAssociatedWith":
                    {
                    "@type": "prov:SoftwareAgent",
                    "name": "Microsoft Presidio",
                    }
            }
        ]
    },
    "conformsTo": "http://mlcommons.org/croissant/1.0",
    "name": "WildChat-1M",
}

Aug 19 '25 15:08 JoanGi

This makes sense

Also, agree on using the vocabulary directly. This is the approach we have been using to capture provenance information.

On the other hand, high-level attributes may be needed. For instance, in the case of provenance, with Prov-O we are capturing the data lineage, the activities that generated the data, and the actors involved. However, we might fall short if we want to adopt in the future other types of provenance information as the content attribution mechanism proposed by C2PA, which involves provenance information that is cryptographically bound to the data point. Additionally, for data access conditions use case, it can make the Croissant description more readable.

Note that it is not a strong opinion, just to have proper context to take the decision.

On the other side, to provide another example of the Provenance use case. Here is an example of the provenance of the WildChat-1M dataset using PROV-O. The diagram presents an quick hint about what are we describing in the Croissant description.

Note that the high-level "provenance" attribute could be removed.
``` { "@context": { "prov": "http://www.w3.org/ns/prov#", }, "@type": "sc:Dataset", "provenance":{ "prov:wasDerivedFrom": { "@type": "prov:Entity", "@id": "[WildChat-1M-Full](https://huggingface.co/datasets/allenai/WildChat-1M-Full)", “url”: “"https://huggingface.co/datasets/allenai/WildChat-1M-Full", "name": "WildChat-1M-Full Dataset", "prov:generatedAtTime": "2024-05-01" }, "prov:wasGeneratedBy": [ { "@type": "prov:Activity", "@id": "journalismPIIremoval", "prov:description":"Conversations flagged by Niloofar Mireshghallah and her collaborators in 'Breaking News: Case Studies of Generative AI's Use in Journalism' for containing PII or sensitive information have been removed from this version of the dataset.", "prov:startedAtTime": "2024-10-17", "prov:wasAssociatedWith": [{ "@type": "prov:Entity", "prov:name": "Case Studies of Generative AI's Use Journalism", “prov:url": "https://arxiv.org/abs/2405.01470" }, { "@type": "prov:Person", "prov:name": "Niloofar Mireshghallah", }] }, { "@type": "prov:Activity", "@id": "acitivity:toxicContentRemoval", "prov:description":"All toxic conversations identified by the OpenAI Moderations API or Detoxify have been removed from this version of the dataset.", "prov:startedAtTime": "2024-07-22", "prov:wasAssociatedWith": [{ "@type": "prov:SoftwareAgent", "name": "OpenAI Moderation API" }] }, { "@type": "prov:Activity", "@id": "acitivity:PIIremoval", "prov:description":"The data has been de-identified with Microsoft Presidio and hand-written rules by the authors.", "wasAssociatedWith": { "@type": "prov:SoftwareAgent", "name": "Microsoft Presidio", } } ] }, "conformsTo": "http://mlcommons.org/croissant/1.0", "name": "WildChat-1M", } ```

This makes sense to me. I'm not completely we need to add the cr:provenance container property, but can be convinced otherwise.

Sep 03 '25 14:09 benjelloun

Merged https://github.com/mlcommons/croissant/pull/955 to add this functionality to the Croissant specification.

The next step is to add example datasets with external vocabularies and make sure the mlcroissant implementation supports doing that.

Oct 09 '25 15:10 benjelloun

Provide guidance on using properties from external vocabularies

Using external vocabularies with Croissant metadata

Using external vocabularies with data