croissant Proposal to add more legal information to Croissant datasets

Hi guys,

I am currently working on an Austrian research project called FAIRMedia, which is concerned with managing datasets in accordance with European law and making them accessible. We have considered that it is important to describe datasets sufficiently so that they are more accessible. We came across Croissant and decided to use it as the description format. In the course of the project, however, we noticed properties that are important for describing a dataset, but which have so far been missing in Croissant. For this reason, I would like to propose adding the following dataset properties to Croissant and possibly start a discussion about them:

Property	Expected Type	Cardinality	Description
`dataUsageTerms`	sc:Text	ONE	Specifies the conditions under which the dataset may be processed, including permitted commercial and non-commercial uses, text and data mining, etc.
`userRights`	sc:Text	ONE	Specific rights of users using the dataset, including permission to modify or redistribute datasets or the obligation to cite the source.
`dataProcessingTerms`	sc:Text	ONE	Description of the legal basis for the collection of personal data; definition of the legal basis for the provision and further processing of personal data.
`controllership`	"sole controllership" or "joint controllership"	ONE	Determination of controllership.
`jointControllerAgreementConcluded`	sc:Boolean	ONE	Has a Joint Controller Agreement (JCA) been concluded?
`liabilityClauses`	sc:Text	ONE	Description of the disclaimers and limitations of liability in connection with the use of the data in order to minimize legal risks for the data provider.
`indemnityClauses`	sc:Text	ONE	Conditions under which the user must indemnify the data provider, including a third-party provider in the event of legal disputes.
`copyright`	sc:Text	ONE	Provide a detailed description about the copyright of the datasets content.
`dataAnonymizationProtocol`	sc:Text	ONE	Description of the anonymization procedures used to protect the identity of individuals in the data.
`dataSecurityProtocol`	sc:Text	ONE	Description of the security measures taken to protect the data, including encryption, access controls and measures to ensure data integrity, including specific safeguards for different types of personal data including sensitive data (e.g. health data, sexual orientation).
`dataProtectionType`	"anonymized" or "personal"	ONE	Determining whether the data is anonymous/anonymized or personal data.
`personalData`	sc:Text	ONE	Description of the categories and types of personal data.

These properties primarily cover legal information about the dataset and contain more detailed information about the anonymization of the data and which personal data is included.

I would also suggest highlighting or recommending the releaseNotes property of schema.org to describe what changed in this version of the dataset with the reasons for the changes.

What do you think about this?

Feb 17 '25 15:02 dominik-kuhn

These are some comparable fields from

FAIRMedia Property	Croissant RAI 1.0 Property	SPDX 3.0 Dataset Profile Property
dataUsageTerms	rai:dataUseCases	intendedUse
userRights
dataProcessingTerms
controllership
jointControllerAgreementConcluded
liabilityClauses
indemnityClauses
copyright
dataAnonymizationProtocol		anonymizationMethodUsed
dataSecurityProtocol
dataProtectionType
personalData	rai:personalSensitiveInformation
	rai:dataBiases	knownBias
	rai:dataCollection	dataCollectionProcess
	rai:dataPreprocessingProtocol	dataPreprocessing
	rai:dataReleaseMaintenancePlan	datasetUpdateMechanism

Looks like FAIRMedia is currently more extensive in the area of legal terms and conditions. Which is directly the "Use case 6: Regulatory compliance" of Croissant RAI.

(It's not that Croissant RAI and SPDX cannot capture these at all, but some of them are probably combined into a single free-text field that make it less easy to extract/work with)

Feb 22 '25 12:02 bact

Thanks for the comparison! We have already considered Croissant RAI and felt that we need to add the above properties. dataUsageTerms is different from dataUseCases as they describe the conditions under which the dataset may be used. So this is about legal information as compared to dataUseCases which only describes the recommended uses. Our personalData property is a more detailed description of personal data and not just the specification of categories of personal sensitive information and is only filled in if the dataProtectionType is set to "personal".

Feb 24 '25 08:02 dominik-kuhn

I wonder if -Terms and -Clauses, at least in your domain, have some common terms/clauses that appear a lot to the point that we can have them as enum? (to make it more machine readable).

Feb 25 '25 13:02 bact

People who are new to datasets are probably going to be confused by the copyright/license fields. The vocabulary names we have now don't really get across that the underlying data in the dataset might have a wildly different license from the dataset license.

In my case, Common Crawl's Terms of Use talk about both in one document, so the exact wording doesn't matter. But I suspect some Croissant builders are going to leave some Croissant users confused.

Mar 16 '25 18:03 wumpus

I support this; we have similar requirements in BioCroissant - https://github.com/mlcommons/croissant/issues/833

Mar 21 '25 13:03 susheel

Thanks @bact for alerting about this work. @all for modelling legally relevant information, I suggest looking at DPV where we have created detailed vocabularies and taxonomies in a way that provides a general jurisdiction-agnostic core which is extended with jurisdiction- and regulation-specific extensions like EU and GDPR. This means you can say dpv:InformedConsent for general consent and eu-gdpr:A6-1-a for GDPR Art.6(1a) defined consent. We also have controller, processor, notices, and more stuff that would be relevant here but hasn't been discussed yet. In addition, we also have an extensive vocabulary for technical measures like encryption, and org measures like impact assessments. Currently, we're adding more concepts to represent stuff from AI standards and AI Act. So no need to reinvent the wheel! And we're very much open and welcoming to extend our vocabularies to support work like this if something isn't covered. See https://w3id.org/dpv/primer for a generic overview and https://w3id.org/dpv for the latest spec.

Mar 22 '25 10:03 coolharsh55