croissant icon indicating copy to clipboard operation
croissant copied to clipboard

Proposal to add more legal information to Croissant datasets

Open dominik-kuhn opened this issue 10 months ago • 6 comments

Hi guys,

I am currently working on an Austrian research project called FAIRMedia, which is concerned with managing datasets in accordance with European law and making them accessible. We have considered that it is important to describe datasets sufficiently so that they are more accessible. We came across Croissant and decided to use it as the description format. In the course of the project, however, we noticed properties that are important for describing a dataset, but which have so far been missing in Croissant. For this reason, I would like to propose adding the following dataset properties to Croissant and possibly start a discussion about them:

Property Expected Type Cardinality Description
dataUsageTerms sc:Text ONE Specifies the conditions under which the dataset may be processed, including permitted commercial and non-commercial uses, text and data mining, etc.
userRights sc:Text ONE Specific rights of users using the dataset, including permission to modify or redistribute datasets or the obligation to cite the source.
dataProcessingTerms sc:Text ONE Description of the legal basis for the collection of personal data; definition of the legal basis for the provision and further processing of personal data.
controllership "sole controllership" or "joint controllership" ONE Determination of controllership.
jointControllerAgreementConcluded sc:Boolean ONE Has a Joint Controller Agreement (JCA) been concluded?
liabilityClauses sc:Text ONE Description of the disclaimers and limitations of liability in connection with the use of the data in order to minimize legal risks for the data provider.
indemnityClauses sc:Text ONE Conditions under which the user must indemnify the data provider, including a third-party provider in the event of legal disputes.
copyright sc:Text ONE Provide a detailed description about the copyright of the datasets content.
dataAnonymizationProtocol sc:Text ONE Description of the anonymization procedures used to protect the identity of individuals in the data.
dataSecurityProtocol sc:Text ONE Description of the security measures taken to protect the data, including encryption, access controls and measures to ensure data integrity, including specific safeguards for different types of personal data including sensitive data (e.g. health data, sexual orientation).
dataProtectionType "anonymized" or "personal" ONE Determining whether the data is anonymous/anonymized or personal data.
personalData sc:Text ONE Description of the categories and types of personal data.

These properties primarily cover legal information about the dataset and contain more detailed information about the anonymization of the data and which personal data is included.

I would also suggest highlighting or recommending the releaseNotes property of schema.org to describe what changed in this version of the dataset with the reasons for the changes.

What do you think about this?

dominik-kuhn avatar Feb 17 '25 15:02 dominik-kuhn

These are some comparable fields from

FAIRMedia Property Croissant RAI 1.0 Property SPDX 3.0 Dataset Profile Property
dataUsageTerms rai:dataUseCases intendedUse
userRights    
dataProcessingTerms    
controllership    
jointControllerAgreementConcluded    
liabilityClauses    
indemnityClauses    
copyright    
dataAnonymizationProtocol   anonymizationMethodUsed
dataSecurityProtocol    
dataProtectionType    
personalData rai:personalSensitiveInformation  
  rai:dataBiases knownBias
  rai:dataCollection dataCollectionProcess
  rai:dataPreprocessingProtocol dataPreprocessing
  rai:dataReleaseMaintenancePlan datasetUpdateMechanism

Looks like FAIRMedia is currently more extensive in the area of legal terms and conditions. Which is directly the "Use case 6: Regulatory compliance" of Croissant RAI.

(It's not that Croissant RAI and SPDX cannot capture these at all, but some of them are probably combined into a single free-text field that make it less easy to extract/work with)

bact avatar Feb 22 '25 12:02 bact

Thanks for the comparison! We have already considered Croissant RAI and felt that we need to add the above properties. dataUsageTerms is different from dataUseCases as they describe the conditions under which the dataset may be used. So this is about legal information as compared to dataUseCases which only describes the recommended uses. Our personalData property is a more detailed description of personal data and not just the specification of categories of personal sensitive information and is only filled in if the dataProtectionType is set to "personal".

dominik-kuhn avatar Feb 24 '25 08:02 dominik-kuhn

I wonder if -Terms and -Clauses, at least in your domain, have some common terms/clauses that appear a lot to the point that we can have them as enum? (to make it more machine readable).

bact avatar Feb 25 '25 13:02 bact

People who are new to datasets are probably going to be confused by the copyright/license fields. The vocabulary names we have now don't really get across that the underlying data in the dataset might have a wildly different license from the dataset license.

In my case, Common Crawl's Terms of Use talk about both in one document, so the exact wording doesn't matter. But I suspect some Croissant builders are going to leave some Croissant users confused.

wumpus avatar Mar 16 '25 18:03 wumpus

I support this; we have similar requirements in BioCroissant - https://github.com/mlcommons/croissant/issues/833

susheel avatar Mar 21 '25 13:03 susheel

Thanks @bact for alerting about this work. @all for modelling legally relevant information, I suggest looking at DPV where we have created detailed vocabularies and taxonomies in a way that provides a general jurisdiction-agnostic core which is extended with jurisdiction- and regulation-specific extensions like EU and GDPR. This means you can say dpv:InformedConsent for general consent and eu-gdpr:A6-1-a for GDPR Art.6(1a) defined consent. We also have controller, processor, notices, and more stuff that would be relevant here but hasn't been discussed yet. In addition, we also have an extensive vocabulary for technical measures like encryption, and org measures like impact assessments. Currently, we're adding more concepts to represent stuff from AI standards and AI Act. So no need to reinvent the wheel! And we're very much open and welcoming to extend our vocabularies to support work like this if something isn't covered. See https://w3id.org/dpv/primer for a generic overview and https://w3id.org/dpv for the latest spec.

coolharsh55 avatar Mar 22 '25 10:03 coolharsh55