Proposal to add more legal information to Croissant datasets
Hi guys,
I am currently working on an Austrian research project called FAIRMedia, which is concerned with managing datasets in accordance with European law and making them accessible. We have considered that it is important to describe datasets sufficiently so that they are more accessible. We came across Croissant and decided to use it as the description format. In the course of the project, however, we noticed properties that are important for describing a dataset, but which have so far been missing in Croissant. For this reason, I would like to propose adding the following dataset properties to Croissant and possibly start a discussion about them:
| Property | Expected Type | Cardinality | Description |
|---|---|---|---|
dataUsageTerms |
sc:Text | ONE | Specifies the conditions under which the dataset may be processed, including permitted commercial and non-commercial uses, text and data mining, etc. |
userRights |
sc:Text | ONE | Specific rights of users using the dataset, including permission to modify or redistribute datasets or the obligation to cite the source. |
dataProcessingTerms |
sc:Text | ONE | Description of the legal basis for the collection of personal data; definition of the legal basis for the provision and further processing of personal data. |
controllership |
"sole controllership" or "joint controllership" | ONE | Determination of controllership. |
jointControllerAgreementConcluded |
sc:Boolean | ONE | Has a Joint Controller Agreement (JCA) been concluded? |
liabilityClauses |
sc:Text | ONE | Description of the disclaimers and limitations of liability in connection with the use of the data in order to minimize legal risks for the data provider. |
indemnityClauses |
sc:Text | ONE | Conditions under which the user must indemnify the data provider, including a third-party provider in the event of legal disputes. |
copyright |
sc:Text | ONE | Provide a detailed description about the copyright of the datasets content. |
dataAnonymizationProtocol |
sc:Text | ONE | Description of the anonymization procedures used to protect the identity of individuals in the data. |
dataSecurityProtocol |
sc:Text | ONE | Description of the security measures taken to protect the data, including encryption, access controls and measures to ensure data integrity, including specific safeguards for different types of personal data including sensitive data (e.g. health data, sexual orientation). |
dataProtectionType |
"anonymized" or "personal" | ONE | Determining whether the data is anonymous/anonymized or personal data. |
personalData |
sc:Text | ONE | Description of the categories and types of personal data. |
These properties primarily cover legal information about the dataset and contain more detailed information about the anonymization of the data and which personal data is included.
I would also suggest highlighting or recommending the releaseNotes property of schema.org to describe what changed in this version of the dataset with the reasons for the changes.
What do you think about this?
These are some comparable fields from
| FAIRMedia Property | Croissant RAI 1.0 Property | SPDX 3.0 Dataset Profile Property |
|---|---|---|
| dataUsageTerms | rai:dataUseCases | intendedUse |
| userRights | ||
| dataProcessingTerms | ||
| controllership | ||
| jointControllerAgreementConcluded | ||
| liabilityClauses | ||
| indemnityClauses | ||
| copyright | ||
| dataAnonymizationProtocol | anonymizationMethodUsed | |
| dataSecurityProtocol | ||
| dataProtectionType | ||
| personalData | rai:personalSensitiveInformation | |
| rai:dataBiases | knownBias | |
| rai:dataCollection | dataCollectionProcess | |
| rai:dataPreprocessingProtocol | dataPreprocessing | |
| rai:dataReleaseMaintenancePlan | datasetUpdateMechanism |
Looks like FAIRMedia is currently more extensive in the area of legal terms and conditions. Which is directly the "Use case 6: Regulatory compliance" of Croissant RAI.
(It's not that Croissant RAI and SPDX cannot capture these at all, but some of them are probably combined into a single free-text field that make it less easy to extract/work with)
Thanks for the comparison!
We have already considered Croissant RAI and felt that we need to add the above properties. dataUsageTerms is different from dataUseCases as they describe the conditions under which the dataset may be used. So this is about legal information as compared to dataUseCases which only describes the recommended uses. Our personalData property is a more detailed description of personal data and not just the specification of categories of personal sensitive information and is only filled in if the dataProtectionType is set to "personal".
I wonder if -Terms and -Clauses, at least in your domain, have some common terms/clauses that appear a lot to the point that we can have them as enum? (to make it more machine readable).
People who are new to datasets are probably going to be confused by the copyright/license fields. The vocabulary names we have now don't really get across that the underlying data in the dataset might have a wildly different license from the dataset license.
In my case, Common Crawl's Terms of Use talk about both in one document, so the exact wording doesn't matter. But I suspect some Croissant builders are going to leave some Croissant users confused.
I support this; we have similar requirements in BioCroissant - https://github.com/mlcommons/croissant/issues/833
Thanks @bact for alerting about this work. @all for modelling legally relevant information, I suggest looking at DPV where we have created detailed vocabularies and taxonomies in a way that provides a general jurisdiction-agnostic core which is extended with jurisdiction- and regulation-specific extensions like EU and GDPR. This means you can say dpv:InformedConsent for general consent and eu-gdpr:A6-1-a for GDPR Art.6(1a) defined consent. We also have controller, processor, notices, and more stuff that would be relevant here but hasn't been discussed yet. In addition, we also have an extensive vocabulary for technical measures like encryption, and org measures like impact assessments. Currently, we're adding more concepts to represent stuff from AI standards and AI Act. So no need to reinvent the wheel! And we're very much open and welcoming to extend our vocabularies to support work like this if something isn't covered. See https://w3id.org/dpv/primer for a generic overview and https://w3id.org/dpv for the latest spec.