Standardize Annotation Requirements for Ontology and Terms
Is there a set of required annotations for named individuals, classes, ontologies, and relations for common core, against which the files are reviewed prior to release?
For instance, I believe the following are universally required:
- rdfs:label
- cco:definition
- cco:definition_source (with the exception of the modal relations ontology, where annotation properties declared in IAO are used)
For measurement units (both classes and named individuals), the following are required (where applicable):
- cco:SI_unit_label
- cco:SI_unit_symbol
The following are optional:
- rdfs:comment
- cco:example_of_usage
- cco:alternative_label
- cco:definition_source
- cco:elucidation
- cco:acronym
For classes and relations, the following is required:
- cco:is_curated_in_ontology
And the following are required for each ontology:
- cco:content_license
- cco:code_license
- cco:copyright
- owl:versionInfo
- rdfs:label
If the above is correct:
- Around 500 resources lack definitions.
- The ROImport file lacks license and copyright statements, and the versionInfo statement for both it and the modal relations ontology has a different form from that of the other files.
- A handful of units of measurement that should have an associated SI unit symbol appear to lack one.
- Some annotations are declared in the file, but don't appear to be used (e.g. term creator, query text).
- Around 1,000 resources lack definition sources (and the format of those presently within the files isn't uniform and doesn't follow a standard citation format (e.g. MLA, CMS, APA)).
It'd be useful to have a list like this maintained somewhere. Even better would be to (automatically) test completeness each time a commit is pushed to github.
A small edit: Individuals, as well as classes and relations, have a cco:is_curated_in_ontology annotation.
@neilotte Currently, we perform a small set of validations as part of a release. Examples of logical ones are: consistency and to ensure a valid OWL DL profile. For metadata, all terms need at minimum a label and curated_in annotation. All classes and properties are further required to have a textual definition, with a few exceptions. We do not enforce that individuals have definitions. Including a definition source is optional. Ontologies need a version IRI, label (title), description, content and code license, version#, copyright (I think the plans is to remove that in subsequent releases). See below for more.
cco:definition_source (with the exception of the modal relations ontology, where annotation properties declared in IAO are used)
Not required, but good practice when possible. Many terms in CCO do not have source info. In some cases this hard to do, e.g., if the term is a product of group development with no clear or citable source. And it would be practically impossible to accurately provide them for older terms given the time passed since creation.
For measurement units (both classes and named individuals),
I don't think the measurement unit classes are specific enough to get SI_unit annotations. Can you provide an example of one?
For classes and relations, the following is required: cco:is_curated_in_ontology
Named Individuals too.
Around 500 resources lack definitions.
By my calculation, according criteria I listed above, there are 85 Object Properties that do not have definitions. ALL of these are subs of cco:has_familial_realtionship_to. Given the clear pattern of naming and obvious intended usage, creating textual definitions for all these seems redundant. It's a small glitch at best. Perhaps some elucidation and comment for the top property is warranted.
There are also 8 Datatype Properties missing definitions. Again somewhat self-explanatory, but
- [ ] I'll task myself with correcting those. Although note Alan's comment in #95, possibly we may delete some of these.
The ROImport file lacks license and copyright statements, and the versionInfo statement for both it and the modal relations ontology has a different form from that of the other files.
- [ ] I'll double check MRO. The RO subset is less clear, I don't think it should have a copyright and any license info probably would be inherited from the source.
A handful of units of measurement that should have an associated SI unit symbol appear to lack one.
I'm sure some were missed. Do you have list?
Around 1,000 resources lack definition sources (and the format of those presently within the files isn't uniform and doesn't follow a standard citation format (e.g. MLA, CMS, APA)).
Note above re current status as a best practice that can not be required in all cases. I am skeptical using a citation standard is helpful, but I'll leave that to others to weigh in on.
@swartik
Even better would be to (automatically) test completeness each time a commit is pushed to github.
Note above, these tests are run automatically as part of a release. It may be helpful to include a list for each release of tests that fail but were not corrected, e.g, all the familial realtions w/o definitions.
It'd be useful to have a list like this maintained somewhere.
I agree.
- [ ] I will draft up a summary of the release process, mostly the tests done, and how we version the ontologies, and then post it a Wiki page. I think it would be helpful to have it documented so folks can discuss and revise on as part of the ongoing MLO development process.
@mark-jensen
- “All classes and properties are further required to have a textual definition, with a few exceptions.”
—I think all classes and properties should have a textual definition, without exception. If the definition is trivial to state, that’s a good reason to state it. One shouldn’t have to assume from the label how a resource is being used (I’m also now seeing that the use of hyphens in the labels is also inconsistent. See ‘has father in law’ vs. ‘is in-law of’). The top relation ‘has familial relationship to’ can hold either legally or by ancestry and so it’s unclear whether it is intended that these sub relations maintain that open range. For instance, ‘is descendent of’ sounds to me to only apply in the ancestral linkage case and not in the legal sense (e.g. Are children acquired through marriage one's descendants?).
- “Many terms in CCO do not have source info. In some cases this hard to do, e.g., if the term is a product of group development with no clear or citable source. And it would be practically impossible to accurately provide them for older terms given the time passed since creation.”
—In such cases, I would advocate that the editor at CUBRC be the cited source of the term. For instance, “‘has father in-law’. Jensen, Mark. CUBRC, Inc. January 1, 2021.” For those there now, you could just do a quick review of them and insert a date prior to a release.
- “I am skeptical using a citation standard is helpful, but I'll leave that to others to weigh in on.”
—Right now, some of the citations appear to follow some sort of citation format, while others are mere hyperlinks, often to wikipedia. This leaves it unclear when the source was consulted. Adherence to a standard citation format would add an ‘accessed on’ date to correct for this. This could be added to the files with a SPARQL insert and then manually checked, so it’s not a huge ask. As an aside: I often hear the complaint that CCO only cites wikipedia for many of its classes. I think these complaints are misguided and that wikipedia is actually a very appropriate source for mid-level classes, but it would help to defuse these complaints if the format of the citation itself were less haphazard.
- Regarding the number of missing definitions, I count 203 object properties missing definitions (this number is high because it includes the corresponding relations in the modal relation ontology).
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX cco: <http://www.ontologyrepository.com/CommonCoreOntologies/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?x
WHERE
{
?x a owl:ObjectProperty .
FILTER NOT EXISTS {?x cco:definition ?def}
FILTER (regex(str(?x), "repository"))
}
I agree you don't want definitions for named individuals. I guess I’d ask then: What are the minimal required annotations for individuals? Right now, I see rdfs:comment is used on many of them. In some cases, there is an excellent description of the instance with a citation in the value of the comment. In other cases, there’s either nothing or a sentence fragment. I'd advocate something more like a short description of 1-4 complete sentences, maybe using a different annotation property since rdf:comment has a pretty wide range.
- Regarding missing SI unit symbols, see for instance cco:SquareMeterMeasurementUnit and cco:CubicMeterMeasurementUnit. There are 7 base units and 22 derived units in the SI — I wonder too if all of these are among the measurement unit instance list.
https://physics.nist.gov/cuu/Units/units.html https://en.wikipedia.org/wiki/International_System_of_Units
I think all classes and properties should have a textual definition, without exception.
I agree. Especially for a MLO standard. So, the question becomes: Do these relations seem reasonable candidates for inclusion in a MLO? If so, then we add definitions and clean up labeling. If not, then somebody needs to take charge of making a proposal to the WG to have them removed.
I would advocate that the editor at CUBRC be the cited source of the term
I am not against adopting a standard way of adding and formatting definition sources. Although, offering a person as source raises consideration of the role of other annotations like term editor, creator, or collaborator. I think some of this decisions can be worked out here but will also require an advocate to take them to the WG for proposal/discussion to standardize annotations. Which I believe will be an important part of the MLO process.
you could just do a quick review of them
As it sits, CCO has 2000+ terms. A quick review seems unlikely! ;-)
I count 203 object properties missing definitions (this number is high because it includes the corresponding relations in the modal relation ontology).
The extra properties above the 85 familial ones without a cco:definition are all in MRO. This is because those few extra properties in MRO without definitions are the ones derived from RO that use the IAO_0000115 property (we just copy the term declarations into MRO, only changing the namespaces of object and datatype properties).
@neilotte Thanks for raising these issues. I think addressing them will require some prioritization. Making sure the SI unit annotations are complete or determining and implementing a standard for source or editors may get delayed over addressing the lack of definitions, or other content-specific issues. But, that will partly be up to interested people raising the concern and then making a formal proposal to the WG. But, without a doubt, annotation standards for the ontology and terms, as well specifics of versioning, and so on, will be important part of refining the MLO standard. I'll follow up here once I've captured what we do now on a wiki page.