CommonCoreOntologies
CommonCoreOntologies copied to clipboard
In IRIs, use opaque identifiers instead of english labels
OBO Policy was designed for good reasons.
First, by using interpretable labels you potentially alienate or confuse users in different communities where terms are known by different names.
Second, we want our ontologies to be used worldwide, and using english in IRIs is not welcoming to non-english speakers. The sanctioned mechanism for providing user readable labels is to use rdfs:label or skos properties, and literals with language tags.
Third, there will inevitably be cases where words are spelled wrong, or disputed, which makes for pressure to "fix" the IRIs. Unfortunately, such fixes are typically breaking changes to users.
Strongly agree.
The main argument for keeping English-readable labels in the IRI is that it makes it easier for developers to recognize a URI on sight. However, it only requires a little re-orientation (e.g. an extra line in a SPARQL query to return labels) to get around this, and if this is really insufficient, then a local mapping could be applied to give all opaque identifiers readable URIs within a local environment. The utility of recognizing URIs at a glance is also undermined, to some degree, when the words in a URI are ambiguous (e.g. 'Document' referring both to a noun and a verb--documenting), and developers navigating the ontologies by URI alone may inadvertently introduce errors in this way.
There are different strategies for versioning ontologies, but I think this also greatly improves trust in each release. If the extension of a URI changes between versions, it is much easier for end users to have this URI deprecated and replaced, rather than require the end user to treat each URI as distinct from the last w.r.t. each release of the ontology with a new version IRI. However, if there are human readable labels in the URI, then there is considerable pressure to maintain the URI. This means IRIs in subsequent versions of the ontology can't really be trusted by default, since their extensions may have changed between versions while the URI remained the same.
I couldn't agree more.
We agree that by using an alpha-numeric IRI similar to BFO we increasing the ability for an international adoption of the standard. This is important from a DoD perspective when encouraging NATO and other non-English speaking allies to conform to the standard. We also agree that it reduces the potential for ambiguity (and endless debate) on the term labels if we already agree on the concept being captured by the entity and its place in the taxonomy.
However, we are concerned that the alpha-numeric IRIs make it difficult for users to work with some tools that display the IRI only (e.g., ONTOP) in their user interfaces so we appreciate efforts to reduce the related burdens to adoption.
Once we get some lessons learned from this process, we will probably follow suit with the DICO.
I am on the fence as to the value of switching. I can see both sides, perhaps with some prefernce towards the use of numerics. There will be a fair bit of upfront work in making the switch, e.g., to workflows and development tools, visualizations based on IRIs vs labels, etc. If agreement is reached, I will advocate that enough time is provided to test and validate before we fully adopt the new IRI standard.
Some initial thoughts on consequences for existing user-applications:
Actively used applications may not need to be updated, providing their data models are stable, no or minimal ongoing development, and no integration is happening with external data sources that use the new IRIs. If the later is the case, assuming the external sources are not using new content from CCO (a dangerous assumption), then perhaps a simpler and more cost effective solution could be to create a separate mapping to integrate the new IRIs, rather than updating the legacy code to use new IRIs. I guess cost/benefit would depend on the complexity of the application.
If an actively used system is in development, ie., new data sources being mapped, content updates desired, then updating to use new IRIs is necessary. However, a potential solution that doesn’t require updating code to use the new IRIs, is to take new releases of CCO with updated IRIs and covert them back to IRIs that use natural language, mirroring the old IRI style and thus matching its use in the legacy code. This goes a step further than a mapping as noted above. It uses a mapping to transform the ontology before it is embed into code for development, essentially creating something akin to a IRI-normalized version of CCO, one that allows new releases to merge with unmodified code that used the old IRIs.
cco:CCO_00002021 [Act of Communication] >>> cco:ActOfCommunication
Terms introduced since the update, ie., no old IRI existed, could also be straightforwardly converted.
cco:CCO_0023828 [Act of Fostering] >>> cco:ActOfFostering
Thinking about the actual transform of exiting code to use the new IRIs, I am trying to find edge cases that could create problems. I currently don’t see any. It seems that a simple line-by-line search of a file, looking for two IRI patterns (full IRI and prefixed), a key/value mapping of old to new term name/IDs, swap, write to output new file. I suppose that the actual prefix used in some files may differ, can’t assume will always be “cco”, but that’s still fairly simple to work around. Can anyone think of more complex cases where a general script like this fails?
It seems the actual transforms will be easy. But, the testing and validation I think will be more onerous and costly for users. And of course, for users who rely on the name/ID to be readable, e.g, lots of writing of SPARQL, use of viz tools such as KARMA, the change will induce a steep learning curve, perhaps new tooling. That’s the biggest reason I can see for not making the change.
I agree that the Common Core Ontologies should convert to use opaque identifiers as IRIs on the condition that before a such a version of the CCO be released that there be some effort made on 1) testing conversion scripts on files generated and used by a variety of different applications and 2) reaching out to developers both open source (e.g. OnTop, KARMA) and commercial (e.g. ontotext) to ask for the ability to switch from viewing IRIs to labels in a chosen language.
Agree both with the goal of switching to alphanumeric IRIs and the need to test/validate/manage the change so as not to unduly break things for CCO consumers.
But this could be done in stages, right? Would the team consider publishing a v2.0 (with alphanumeric IRIs) to run in parallel with the current v1.3? CUBRC could provide the info necessary to map from one to the other (whatever that might look like... e.g. could be an annotation in the ontology itself). But CCO consumers would then have a specified timeframe (e.g. a year) where they could migrate, test/validate, etc, but without anyone forcing them to convert to v2.0 immediately. Just food for thought.
@rorudn What would be included in "the variety of different applications"? And are there known issues with opaque IDs that have been encountered with OnTop, Karma, or Ontotext? Given how common the practice is of using opaque IDs in URIs, I guess I'd be a little surprised if one of the widely used tools really required human readable IDs presently.
@mark-jensen Regarding "However, a potential solution that doesn’t require updating code to use the new IRIs, is to take new releases of CCO with updated IRIs and covert them back to IRIs that use natural language, mirroring the old IRI style and thus matching its use in the legacy code. " I think that makes sense. If you published a table of such a mapping in the initial release switching to opaque ids, this should be sufficient for anyone requiring human readable URIs to continue doing so for the time being. Sounds like a good solution. (Just a thought too: since the user base is growing, it might be nice to have a mailing-list for CCO that users could hop on or off of. This would allow you to survey your user base, understand their needs, and make announcements like this).
The DICO team likes the above outlined approach outlined by Mark and Brian and Neil's recommendation for how the community might be able to better share lessons learned and tips for dealing with the Opaque IRIs as they are being implemented. We will undoubtedly run into issues as simple as reading a .ttl file directly (which is a useful tool to show people how straightforward the modeling is).
I'm not specifically opposed to changing to alphanumeric IRIs; however, I think that some of the arguments against continuing to use natural language IRIs are not as strong as they may initially appear.
Alan's first and second points are versions of the same claim, namely: human-readable English IRIs are not helpful to some ontology users. While true, the solution in both cases is the same as for alphanumeric IRIs -- use the rdfs:label. For example,
http://www.ontologyrepository.com/CommonCoreOntologies/Document
currently has the annotation:
rdfs:label [language: en] "Document"
Any ontology that will be actively used by a group of non-English speakers should also have a complete set of rdfs:label annotations with values from the language in question, e.g.:
rdfs:label [language: es] "el documento"
With the exception of users whose primary language uses a non-Latin alphabet, I would contend that the use of English-based IRIs is not specifically less friendly to non-English speakers than alphanumeric IRIs are to everyone. That being said, I would NOT want to try to type out IRIs in Cyrillic, Arabic, or Chinese characters. Additionally, while it is typically easier for English speakers to remember an English word or phrase than to remember a quasi-random 7-digit number, I grant that extra-long IRIs can be cumbersome in their own right simply because of their length in comparison to a standardized 11 character local name (e.g. CCO_0123456).
Regarding different communities using different terms differently, if the local IRI isn't specific enough, hopefully the rdfs:label is. If, however, that also fails to satisfy, that's what we have more specific annotation properties for. Specifically, 'alternative label' is used frequently in the CCO to help address this issue. For example, 'Combustion' includes 'Combustion Process' and 'Burning Process' as alternative labels. If that is still insufficient, users are free to create their own preferred label annotation property to use for their specific project. The point here is that, whichever solution is used to handle community-based terminology disagreements, it will be the same solution regardless of how the IRI is structured.
Alan's third point -- misspelled IRIs -- is a fair criticism that only applies to human-readable IRIs. However, given that every new CCO term must be vetted by a working group composed of highly motivated and detail-oriented volunteers, I doubt that this situation will arise frequently enough for it to outweigh the benefits of human-readable IRIs. Furthermore, depending on the situation, the term can either (a) be deprecated and replaced, or (b) forever remain misspelled (with a corrected rdfs:label if necessary).
Neil argues that ambiguity can undermine the utility of human-readable IRIs. Fair enough, but since this issue also applies to rdfs:labels, the solutions are the same in both cases. Namely, developers should design the IRI and label of each term to be sufficiently unambiguous and should provide quality human-readable definitions for each term. In my experience using OBO Foundry ontologies and reviewing projects that use them, there are inevitably errors caused by users frequently not looking at more than just the term label. For example, when one ontology uses 'Hospital' to represent the healthcare facility, another ontology uses 'Hospital' to represent the healthcare organization, and a user decides that both terms are equivalent in their application. This is a simple example that could be avoided by using more precise labels (e.g. 'Hospital Facility' and 'Hospital Organization'), but avoiding all such problems requires more just a well-designed ontology.
The most compelling argument I've seen against using natural language IRIs is Neil's point about what should happen if we change the meaning of a term significantly enough that we decide to deprecate it. We could choose to keep the IRI and make a note of the change in the release notes, but Neil points out that doing so could cause users who don't check the release notes closely enough to start using the term incorrectly. If instead we deprecate the term, users will be forced to resolve the issue when their models, queries, etc. break due to the obsoleted IRI. This approach is arguably more user-friendly, but it could put the developers in an awkward situation if natural language IRIs are used because term 'X' is now unavailable (at least for the current release) to be used. This means that another, perhaps less than ideal, term must be used instead. I expect that this sort of scenario will occur very rarely for mature ontologies, however it is an awkward situation to be in and it does not affect ontologies that use alphanumeric IRIs.
As has been pointed out by at least Mark, Ron, and Forrest, the main reason for using natural language IRIs is to facilitate the use of the ontologies. Not every semantic tool is currently built to leverage rdfs:label annotations, programmers find it easier to work with meaningful IRIs, and writing queries, mappings, etc. using natural language IRIs is significantly faster/easier. Granted, there are workarounds for at least some of these use cases, but more needs to be done to increase support for developers and users alike.
One such workaround is the use of custom prefixes for individual terms in SPARQL queries. For example:
PREFIX has_part: <http://purl.obolibrary.org/obo/BFO_0000051>
and then we can write, e.g.:
?s has_part ?p .
instead of:
?s obo:BFO_0000051 ?p .
in our query.
This solution works, but it adds more work for query writers because every new term used in a query means looking up and adding a new prefix. This burden can be partially mitigated by keeping a file with common prefixes handy to be copy and pasted into queries, but maintaining such a file is a burden in itself. Furthermore, in cases where a query (without individual term prefixes) is already 100+ lines, implementing the prefix solution only increases the length of the query and complexity of maintaining and troubleshooting it.
@neilotte The KARMA mapping tool displays term IRIs to the end user in the process of mapping data. I understood @harefb to be saying that this is true also of the OnTop tool. Other applications that I think should be encouraged to facilitate ease of use with opaque IRIs are SPARQL query editors and programming IDEs. BTW, in my opinion you underestimate the difficulty users will experience with SPARQL query editors when using ontologies with opaque IRIs. In my estimate building queries using the usual workarounds when the number of terms enters the hundreds will end in a result that is difficult to comprehend, which hinders sharing and debugging, and which can break length constraints.
@rorudn There's a comment at the bottom of this thread indicating KARMA can display by rdfs:label. I'm not a regular KARMA user these days so can't verify this myself. Dave Lutz would be a good person to reach out to regarding label rendering in OnTop.
I could be underestimating the difficulty. I'd be interested in a seeing an example of the sort of query that would be difficult to translate. Right now, the SPARQL interface in GraphDB allows for autopopulating prefix statements AND automatically recognizing resources specified within a prefix statement and populating a dropdown within the query interface. This makes for a fairly intuitive interface for query building, even with opaque identifiers.
Draft Motion: The working group will test and evaluate use of a version of the CCO having opaque identifiers. Testing will be end-user testing of the version in a limited number of applications known to be used by consumers of the CCO. Evaluation will be the generation of a sample of conversion scripts that update files created in the test applications using current or past versions of the CCO to the test version of CCO. Upon completion of the testing and evaluation the working group will consider ("vote on") the adoption of the new version.
@neilotte I was unaware of both the adoption of labels in KARMA and the capability to use labels in the SPARQL editor of GraphDB. It even allows the choice of language tags. Very cool, thanks.
The ONTOP mapping interface does pull from the IRIs (everything after the “/” or “#”). I do not think you can configure it to render labels in the mapping interface. It is fairly rudimentary since it is open source.
And, of course, there is the fact that it’s nice when you can read the line directly in the .ttl file with a text editor and not require some tool to parse the verbiage to understand the triples. I would like to think that was one of the principles that drove the development of the semantic web in the first place.
All that said, we still support the move to alpha-number designators. The fact that I can understand an IRI by reading the words in it means that I can also get caught up in my own interpretation of what the word means which might be different than the underlying intent of placing the entity in that “spot” in the ontology. Good semantics getting in the way of good semantics, semantically speaking…
V/R Forrest
Forrest B. Hare, PhD, CISSP SAIC Fellow Solution Developer | Cyberspace Operations 571-419-0084 | [email protected]mailto:[email protected] saic.comhttp://www.saic.com/ |@SAICinchttps://twitter.com/SAICinc SAIC Redefining Ingenuity ™
From: "J. Neil Otte" [email protected] Reply-To: CommonCoreOntology/CommonCoreOntologies [email protected] Date: Monday, March 1, 2021 at 12:31 To: CommonCoreOntology/CommonCoreOntologies [email protected] Cc: "Hare, Forrest B." [email protected], Mention [email protected] Subject: Re: [CommonCoreOntology/CommonCoreOntologies] In IRIs, use opaque identifiers instead of english labels (#105)
EXTERNAL EMAIL -- This message originates from outside of SAIC
@rorudnhttps://github.com/rorudn There's a comment at the bottom of this thread https://github.com/usc-isi-i2/Web-Karma/issues/217 indicating KARMA can display by rdfs:label. I'm not a regular KARMA user these days so can't verify this myself. Dave Lutz would be a good person to reach out to regarding label rendering in OnTop.
I could be underestimating the difficulty. I'd be interested in a seeing an example of the sort of query that would be difficult to translate. Right now, the SPARQL interface in GraphDB allows for autopopulating prefix statements AND automatically recognizing resources specified within a prefix statement and populating a dropdown within the query interface. This makes for a fairly intuitive interface for query building, even with opaque identifiers.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/CommonCoreOntology/CommonCoreOntologies/issues/105#issuecomment-788130361, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARLCKT3F3UJKGBGLSTN7X33TBPFOBANCNFSM4XOEOJWQ.
This communication (including any attachments) may contain information that is proprietary, confidential or exempt from disclosure. If you are not the intended recipient, please note that further dissemination, distribution, use or copying of this communication is strictly prohibited. Anyone who received this message in error should notify the sender immediately by telephone or by return email and delete it from his or her computer.
Revised Draft Motion: We agree, in principle, to convert CCO to opaque identifiers, pending further testing.
I agree with the revised motion.
@neilotte @harefb @bdonohue29 I am sure a branch to start testing is coming soon. Re providing a mapping between old and new IRIs, as users, would you prefer a simple two-column .csv, or something RDF-based, such as a supplemental file containing equivalency axioms or use of an annotation prop on new terms (e.g., CCO_0000001 legacy_term_name "ActOfDating")?
DIA/SAIC team agrees with revised motion. If the terms all still have the English language labels, I think we could just make our own two-column CSV files for the cheat sheets. But a concatenated annotation property might be useful too.
Some further comments.
Tooling built for people to interact with can be engineered in a way that supports readable labels, alternative or community chosen labels, and the ability to switch among them. Some of the objections to this proposal boil down to "yes, but that takes effort". Indeed it does, but not an inordinate amount. SPARQL queries are often written by hand and so are considered to be a pain point but there are at least two reasonable workarounds. The first @neilotte mentions - add a line to the query to retrieve the IRI based on the label. The other is, as @apcox suggests is use a convention of defining readable prefixes for URIs. An example can be seen in an appendix to a paper about one of my projects.
Such prefixes look like they would be extra work to add, but they don't need to be added manually. Many SPARQL front ends have facilities to autocomplete prefixes and so that can be taken advantage of. I've used (YASGUI)[https://github.com/TriplyDB/Yasgui] and did a prototype at one point that did this, automatically adding the prefix definition when the prefix was autocompleted within a query. IIRC there is a similar facility in GraphDB's query editor. In other case I've written code that constructs SPARQL queries programmatically and in that case the prefixes are added as needed. Similarly, for display of results one bind the labels in the queries or process the results (again within a SPARQL user interface) to automatically replace IRIs in results with hyperlinked labels.
In most other cases there really ought not be exposure of developers to raw IRIs in the first place. In most cases, one shouldn't hand edit RDF as it is too prone to errors. Instead APIs that generate the RDF, of which there are several and in different programming languages, will be preferable. So, I don't think it's nice to be able to read a ttl file. I suspect a rather small minority of the eventual audience for CCO will consider that a benefit. In order to achieve wide adoption tooling will, in any case, need to be developed in a way that is relatively easy for consumers to work with.
As @APCox notes, using labels in software, as opposed to IRIs is still open to issues of labels changing over time. In the ideal case software is written to establish the link between label and IRI at the time of authoring, maintain the IRI as the primary identifier when documents are stored, and dynamically add back labels on display. Here too there are choices. On can always display the same label as was originally used, but with confidence that the correct IRIs will be maintained or allow dynamic choice of label source and display using the most current label from a given source.
Getting in the habit of tooling software to be friendly on entry or to display using labels is just good practice. It yields benefits as soon as one expands to a wider community of users and collaborators. Investments are relatively minor when viewed in a context of an ecosystem that is intended to last, or even compared to the number of person hours that will be spent on CCO's standardization.
Even if CCO keeps using labels in IRIs, tooling will still have to be developed to use opaque IRIs because other ontologies, including BFO, are using them. I've seen too many cases, now, where BFO terms are incorporated without labels and tools that make the assumption that IRIs include labels lands up displaying the opaque IRIs when there is a perfectly good label. With a uniform practice of using opaque IRIs it won't be the case that ontologies that use them are at a disadvantage.
@mark-jenson proposes tooling that creates label style IRIs from the opaque IRIs. I think that's a very bad idea. What we absolutely don't want is to have terms that mean the same thing but with different IRIs.
I support @bdonohue29's suggestion that for the purposes of having a manageable transition, the version number is bumped and the last label IRI version continues to be available, but is not further developed. However, it should be made clear that any work that needs to use a new term only available in the new version, or which depends on interchange of CCO structured data needs to adopt the opaque IRI version. OWL's owl:incompatibleWith can be used to make clear that the two versions are not compatible.
@rorudn's revised draft motion is a good way to start.
See also #108, #109 which also pertain to the form of IRIs.
During today's MLO meeting today Brian Haugh suggested that another suggestion I made - to use foaf or other external ontology terms in some cases - was inconsistent with my view on opaque IRIs, since, for example, foaf does use natural language in the IRIs.
I don't think these views are inconsistent. The suggestion is that we shift towards always using labels and hiding IRIs. There are several points I make, but one of them is that tooling be built to uniformly use labels. If we use external terms that use natural language and don't have a label, the idea would be to assert a label for them. Doing that means we can have a uniform policy that labels are available should be used in tools.
In near future, we shall be creating a branch with the numeric IRIs for testing. Feedback requested before we do.
- Following from OBO, the format is: CCO_0000000-CCO_9999999
- Considering the idea for adding meaningful ranges to the assingment of the numeric IDs, e.g., all
Information Content Entitiesget CCO_0000001 - CCO_00005000, etc: The more I think about it, the less enamored I am by the idea. It’s not clear what practical benefit this will actually bring in usage. It leads to bookkeeping overhead, may slow down adding new content if editors have to confirm which range to create new IRIs in, and will require additional validations before release. It’s less extensible than simply randomizing the process, for it could break as the ontologies grow or get refactored, e.g., as IRIs get deprecated. Other thoughts on pros/cons? - CCO has named individuals in the ontologies, e.,g, for measurement units. I assume these too should be made numeric?
- Same goes for annotation and data properties, presumably we should replace as with object properties? CCO created its own annotation properties, roughly matching the IAO ones used in BFO, keeping only RDFS
labelandcomment, mostly just because of the convenience of having readable IDs. I noticed BFO2020 uses SKOS for annotations now, which have language-based IDs. @alanruttenberg was that change due to a requirement of the ISO standardization process, or for some other reason? As an extension of BFO and upcoming standard, what do you recommend for CCO? Should we switch to use of SKOS rather than numericise our annotation props?
I agree with your view on meaningful ranges. It seems to me to make more sense to assign ranges to editors that they can then use for auto-generation.
An easy way to convert could be hash(IRI) % (10 ** 7) in Python (although there might be collisions).
@mark-jensen
-
I suggest using hyphens in the local IDs rather than underscores. e.g. CCO-0000000, rather than CCO_0000000. This is a common best practice for URLs.
-
Recommend maintaining a registry in your local dev environment where you can reserve a URI CCO-0000000-CCO-9999999. This would mean if every time you want a new one, you'd grab it from the registry, and now you can use it and no one else. This would preclude the need to maintain different ranges for different domains, which could get messy.
-
Maybe one structure for classes and properties and a different prefix for individuals? CCI-0000000?
-
Definitely onboard with using skos, dcterms, IAO, and other annotation properties wherever appropriate.
-
The rationale for using dash instead of hyphen is that hyphens are parsed as spaces by search engines but underscores not. That works well for search for cases where the separated things are meaningful on their own, but in this case they are not. Probably not desirable to return pages with the term CCO in one place and 0000000 in another place if you search for CCO-0000000.
-
Registry is nice but it has to be engineered to make sure there aren't race conditions that might result in cases where two people get the same id. It would be nice if this was something built in to protege. People sometimes use GUIDs because then you can allocate, without coordination, ones that are highly improbable to collide. In OBO, IIRC, the ranges were allocated to groups maintaining different parts of the ontology. In that case it is easier to coordinate on who gets an id. Grouping by upper level type might make coordination harder.
-
Numeric yes. Neutral on different prefixes. There's actually an argument that we shouldn't use meaningful prefixes because it results in a social cost if terms need to be moved to other ontologies where they can be better maintained. It's not rational but people definitely get attached to the idea that all the terms in an ontology have the same prefix. I think that's why the handle system used numeric prefixes.
-
+1
@alanruttenberg @neilotte @bdonohue29 @eliasweatherfield @harefb
A version of CCO using numerics is now available for testing here
There is a mapping file here
We stuck with underscores and made no meaningful ranges of IDs to separate entities. One thing that did come up in discussion was the idea of making properties with inverses have IDs in sequence, which seems to be fairly common in OBO-land. I can see some benefit to grouping them like that, but only a small one to users that routinely interact with IRIs in certain ways.
Please follow back with any ideas for revision, or concerns, potential problems and so forth.
I find the proposal to use numeric URIs in CCO unreasonable for the following reasons
- The rationale for using numeric URIs is completely undercut in OWL ontologies by the use of rdfs:label to provide a standard label for all elements of an ontology. Use of rdfs:label has the effect of establishing a common name for each element. Thus, if non-English language users have an objection to an English language IRI, they would have the same objection to a standard English language label. Using numeric URIs that are difficult to comprehend just moves the standard name issue to rdfs:label.
- It is useless in OWL to try and have multiple alternative values for rdfs:label since it would be difficult for any tools to distinguish between them and identify whatever preferred label any particular community would like to display to users.
- Communities that want to display different labels do well to use different label annotation properties for their community so that they are readily distinguished and easy to select for display. Protege, in particular, provides mechanisms for identifying preferred label annotation properties for display in Protege.
- When resolving errors or warnings from reasoners, it is frequently necessary to use a text editor to find the source of the problem in the ontology file. But, it is very difficult to recognize classes and properties that are using purely numberic URIs when using a text editor. Having an English language term that is readily recognizable greatly facilitates handling such issues in a text editor.
- There are also other tools that do not display labels in their user interface. Some are cited above. I have encountered this in a Natural Language Processing tool, which reported NLP extractions from text using the URIs from ontologies. These were difficult to follow when BFO classes were involved.
@BrianHaugh I find your arguments rather unpersuasive. To address each in turn:
-
"Use of rdfs:label has the effect of establishing a common name for each element." If by "common" you mean "standardized," this is simply false. The rdfs:label property may be used in this way to try to establish a globally shared linguistic term (as in a glossary, lexicon, etc), but it needn't be, shouldn't be, and in reality, can't be, because there are no such thing as globally standardized terms. (Thanks a lot, Tower of Babel.) Practically speaking, an ontologist must select a default term for something, but this is by no means the best, clearest, or exclusive way to refer to a class or property in ordinary human language or even in technical jargon. This is okay, because the terminology is not the means by which we align disparate data. The URI is. The role of terminology is just to help a relatively restricted set of users who need to be able to interpret the intent of the ontology accurately. That's all.
-
"It is useless in OWL to try and have multiple alternative values for rdfs:label since it would be difficult for any tools to distinguish between them and identify whatever preferred label any particular community would like to display to users." In argument 3, you mention alternative annotation properties. That's one viable way to do it. Even in Protege, you can specify which annotation property you want to use to render as the label. Another is to annotate the annotation (e.g., with the source or as terminologically preferred by a particular community). This is easy to add in Protege, and trivial for a tool to query. If a tool cannot do this simple operation, it is a deficient tool.
-
No disagreement here: a community certainly can use a custom label annotation property. They could also do something like: define them as sub-properties of rdfs:label (or some other generic label annotation property) to allow an inference-based query of all labels. But this isn't an argument against opaque numeric IRIs. Protege, among other tools, tolerates different "views" on the same underlying ontology. That's what numeric IRIs are trying to promote as well: a shared representation of reality amid inescapably diverse conceptualizations and terminologies used to describe reality.
-
Are annotations not queryable in a text editor?
-
My advice would be to use better tools.
And additionally, you don't provide any arguments against the many benefits from using neutral numeric IRIs, e.g., that in the social world terminological preferences change rapidly over time, but IRIs never should.
@BrianHaugh I address your latest comments in order below. Please see my March 1 comment above for more of my thoughts on this matter.
- You claim that the semantics provided by the values of rdfs:label undercut the use of alphanumeric IRIs. This simply isn't true. Given that ontologies are semantic representations by nature and the label annotation is specifically designed to accommodate these semantics, a human-readable label ought to be included. But it is the IRI -- not the label -- that sets the "common name for each element". Furthermore, as you state in your third point, it is possible to use preferred labels or other custom annotation properties to capture terminological differences for the same type of entity. Ultimately, the label together with the textual definition, logical definition, and other annotations is what determines the sort of entity represented by that IRI. The label is a small (albeit highly visible) part of the semantics. Finally, as I mentioned in my comment above, you can in fact include and leverage multiple values for rdfs:label by using language tags for each value. There are examples of this publicly available on Bioportal and the CCO has been extended in this way by some users.
- Including multiple values for rdfs:label will only be problematic (though I wouldn't say "useless") if care isn't taken in how they are handled. See my response to your first point.
- Yes. This is, in fact, the point that the pro-alphanumeric IRI side has been making all along and is the solution to the above 2 issues.
- Working with "raw" RDF in any number of scenarios is still my biggest complaint against switching from meaningful IRIs. I grant that it is often a pain due to the extra effort and lack of automated tools to make sense of the data; however, the question is whether this is sufficient reason to forego the benefits of switching to alphanumeric IRIs. The general consensus is that it is not. Hopefully, the community can develop simple effective tools to alleviate this burden.
- The fact that many tools currently lack or only provide minimal support for leveraging rdfs:label or other annotation properties is unfortunate and is my other concern about switching to alphanumeric IRIs. However, some tools do exist and there are workarounds for a number of application scenarios. Additionally, by committing to this change, tool developers will be increasingly incentivized to improve existing tools or develop new ones -- and we certainly need more and better tools in this domain. Ultimately, I'm confident that this concern will be resolved in due time.
Let me elaborate more on my objections to using opaque URIs in response to some of the replies:
-
"Use of rdfs:label has the effect of establishing a common name for each element. The de facto practice in BFO and many of its derivatives, such as prior versions of the CCO, the Cyber Ontology, and the U.S. Army Operational Environment ontology is to provide a single value for rdfs:label annotation properties, which is used as a "standard name," for the corresponding element. This name is cited in the included definition of the class/property. Citing the names/labels of superclasses in definitions is a recommended practice by BFO. These labels and definitions are parts of the standard (if/when it is made a standard). Hence, it seems appropriate to acknowledge them as "standard" names.
Although one can distinguish different uses of rdfs:label via language tags or other annotation property annotations, that has not been done in those ontologies derived from BFO with which I am familiar (though I understand that some of the OBO foundry ontologies do this). If different language versions are used, do we expect all such variants to be incorporated into future versions of a standard or will the only the "standard" English names and definitions be promulgated in a standard such as the proposed CCO? If any case, the proposed CCO even with opaque URIs does not have multi-lingual labels and definitions. So, such a standard will have "standard" human-comprehensible names and definitions in English, at least in the initial release. -
Granted that it is possible to distinguish different labels formulated using rdfs:label by using annotations, such as the language tags. But, not all ontology tools and applications support displaying labels/names based on such tags. And, the language tags will not suffice to distinguish variations in same-language usage among different communities (e.g., different terms used for the same class concept by different armed services).
-
Different communities are free to add whatever alternative labels (using different annotation properties) that they would like to an ontology, regardless of whether or not it uses opaque URIs. Opaque URIs are not needed to support alternative labels/names for different communities. There is no great benefit to using opaque URIs in this regard. Such opaque URIs are not needed for any practical purpose, but only serve to address the feelings of some communities that might not like a "standard" English language term for concepts that they refer to differently. Some such communities might also object to the widespread use of English in international journals. Should we start using numeric URIs for concepts cited in journal articles - I don't think so :-).
-
When resolving errors or warnings from OWL reasoners, it is frequently necessary to use a text editor to find the source of the problem in the ontology file. But, it can be very difficult to recognize classes and properties that are using numeric URIs when using a text editor. Having an English language term that is readily recognizable greatly facilitates developer's recognition of the content of these files when resolving errors.
No text editors that I know of will automatically replace URIs with labels and then back again when you save the file :-). It would not really be a text editor if it did that. There is often a need for human developers to be able to read the OWL files in their native format (e.g., RDF/XML or Turtle) in order to find errors and correct them. -
There are also other tools that do not display labels in their user interface. One may have limited or no choice in what tools are used in applications of ontologies. Some projects specify the use of certain tools and some applications are specified as parts of programs which developers have to use. I cited one NLP software tool, which was part of an information extraction project using an ontology based on BFO. It displayed the URIs in its interface, with no option to display labels. Developers and reviewers had no other option but to view the BFO numeric identifiers in this case. Not so bad with the limited number of BFO classes, but would be an incredible pain with all the CCO classes being opaque.
-
A community that prefers a different language over English would likely take offense at the BFO use of English throughout for labels, definitions, elucidations, editors notes, and "axioms". The OBO Foundry even has a principal that "Labels and synonyms should be written in English". English has already been established as the standard language for BFO and many of its descendants (mid-level/domain ontologies). Just making the URIs opaque does very little to address this bias in BFO and related ontologies. Nor is there any need to address this "bias" since English has been recognized as the language of choice for international communications (e.g., in professional international journals).
-
I believe that non-opaque, human-readable, identifiers are most widely used in other ontologies, such as Cyc, SUMO, Dublin Core, and FOAF. The Open Biological and Biomedical Ontologies (OBO) Foundry is the only effort with which I am familiar that has actively promoted the use of such opaque URIs for ontology classes and properties. There is no need to follow their approach, which makes raw ontology files practically illegible to humans.
I will attempt to simplify this issue tremendously.
- An international standard should be language agnostic. Imagine if the CCO had been developed by the Ethiopians. Whether it were written in Aramaic, or in alpha-numerics, like BFO, it is still the same effect to me. I, personally, wouldn’t be able to discern the meaning of the script. So I think that is just a requirement that should be a no-brainer for international standards.
Given the above point, I offer the following additional considerations:
-
I totally agree with Brian that the idea of “just use better tools” to address the challenges it will present for us is very naïve. Trying to deal with the BFO alpha-numerics is painful enough. We are NOT looking forward to having to deal with CCO alpha-numerics as well (but we will if they are international standards). Brian already provided a strong argument showing the difficulties so I won’t repeat. I will just summarize with the fact that the suggested conventions add even more complexity to a field that is already too complex for the average person to absorb. Dealing with that complexity takes resources that ultimately cost our user base time and money (everyone reading this is already expensive). Why extend the winter even longer if we don’t have to?
-
I think this issue is yet another reason to make the set of terms codified as an international standard as small as possible or at least practical.
-
For those who are interested, we don’t plan to make DICO an international standard. If there is a term in there that we think should be a standard, we will recommend it to CUBRC to add to CCO to standardize if they want. Therefore, we will maintain English entity names and labels.
Regards, Forrest
From: Brian A Haugh @.> Sent: Wednesday, October 6, 2021 1:00 AM To: CommonCoreOntology/CommonCoreOntologies @.> Cc: Hare, Forrest B. @.>; Mention @.> Subject: Re: [CommonCoreOntology/CommonCoreOntologies] In IRIs, use opaque identifiers instead of english labels (#105)
EXTERNAL EMAIL -- This message originates from outside of SAIC
Let me elaborate more on my objections to using opaque URIs in response to some of the replies:
-
"Use of rdfs:label has the effect of establishing a common name for each element. The de facto practice in BFO and many of its derivatives, such as prior versions of the CCO, the Cyber Ontology, and the U.S. Army Operational Environment ontology is to provide a single value for rdfs:label annotation properties, which is used as a "standard name," for the corresponding element. This name is cited in the included definition of the class/property. Citing the names/labels of superclasses in definitions is a recommended practice by BFO. These labels and definitions are parts of the standard (if/when it is made a standard). Hence, it seems appropriate to acknowledge them as "standard" names.
Although one can distinguish different uses of rdfs:label via language tags or other annotation property annotations, that has not been done in those ontologies derived from BFO with which I am familiar (though I understand that some of the OBO foundry ontologies do this). If different language versions are used, do we expect all such variants to be incorporated into future versions of a standard or will the only the "standard" English names and definitions be promulgated in a standard such as the proposed CCO? If any case, the proposed CCO even with opaque URIs does not have multi-lingual labels and definitions. So, such a standard will have "standard" human-comprehensible names and definitions in English, at least in the initial release.
-
Granted that it is possible to distinguish different labels formulated using rdfs:label by using annotations, such as the language tags. But, not all ontology tools and applications support displaying labels/names based on such tags. And, the language tags will not suffice to distinguish variations in same-language usage among different communities (e.g., different terms used for the same class concept by different armed services). -
Different communities are free to add whatever alternative labels (using different annotation properties) that they would like to an ontology, regardless of whether or not it uses opaque URIs. Opaque URIs are not needed to support alternative labels/names for different communities. There is no great benefit to using opaque URIs in this regard. Such opaque URIs are not needed for any practical purpose, but only serve to address the feelings of some communities that might not like a "standard" English language term for concepts that they refer to differently. Some such communities might also object to the widespread use of English in international journals. Should we start using numeric URIs for concepts cited in journal articles - I don't think so :-). -
When resolving errors or warnings from OWL reasoners, it is frequently necessary to use a text editor to find the source of the problem in the ontology file. But, it can be very difficult to recognize classes and properties that are using numeric URIs when using a text editor. Having an English language term that is readily recognizable greatly facilitates developer's recognition of the content of these files when resolving errors.
No text editors that I know of will automatically replace URIs with labels and then back again when you save the file :-). It would not really be a text editor if it did that. There is often a need for human developers to be able to read the OWL files in their native format (e.g., RDF/XML or Turtle) in order to find errors and correct them.
-
There are also other tools that do not display labels in their user interface.
One may have limited or no choice in what tools are used in applications of ontologies. Some projects specify the use of certain tools and some applications are specified as parts of programs which developers have to use. I cited one NLP software tool, which was part of an information extraction project using an ontology based on BFO. It displayed the URIs in its interface, with no option to display labels. Developers and reviewers had no other option but to view the BFO numeric identifiers in this case. Not so bad with the limited number of BFO classes, but would be an incredible pain with all the CCO classes being opaque.
-
A community that prefers a different language over English would likely take offense at the BFO use of English throughout for labels, definitions, elucidations, editors notes, and "axioms". The OBO Foundry even has a principal that "Labels and synonyms should be written in English". English has already been established as the standard language for BFO and many of its descendants (mid-level/domain ontologies). Just making the URIs opaque does very little to address this bias in BFO and related ontologies. Nor is there any need to address this "bias" since English has been recognized as the language of choice for international communications (e.g., in professional international journals). -
I believe that non-opaque, human-readable, identifiers are most widely used in other ontologies, such as Cyc, SUMO, Dublin Core, and FOAF. The Open Biological and Biomedical Ontologies (OBO) Foundry is the only effort with which I am familiar that has actively promoted the use of such opaque URIs for ontology classes and properties. There is no need to follow their approach, which makes raw ontology files practically illegible to humans.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/CommonCoreOntology/CommonCoreOntologies/issues/105#issuecomment-935457922, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARLCKTYCVTN64VS5DU3L2Y3UFPJWFANCNFSM4XOEOJWQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
The information contained in this e-mail and any attachments from Science Applications International Corporation ("SAIC") may contain confidential and/or proprietary information, and is intended only for the named recipient to whom it was originally addressed. If you are not the intended recipient, any disclosure, distribution, or copying of this e-mail or its attachments is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by return e-mail and permanently delete the e-mail and any attachments.
A few notes to respond to @BrianHaugh 's concerns:
- Terms being used in definitions typically reflect the "ontologist-preferred terminology," which needn't be taken as a "standardized name," much less the only name, and certainly not as most user-friendly term for human consumption (e.g. "Ratio Measurement Information Content Entity").
- The significance of textual definitions can be misunderstood. A textual definition should express the essence of the entity, but it is still just a linguistic artifact. Accordingly, there can be different valid ways of linguistically expressing the same essence.
- Terminological preferences evolve over time, very rapidly. URIs should not. In fact, they should be permanent, immutable. Accordingly, tying URIs to (current) terminological preferences poses the risk of needing to update the URI later, which is always a breaking change.
- A URI is not for human consumption. It is for machine consumption.
- The fact that other ontology efforts -- e.g., Cyc, SUMO, Dublin Core, FOAF -- use natural language URIs obviously does not tell us that this is in fact best practice. If anything, BFO, CCO, and the OBO Foundry are predicated on the belief that a lot of people are doing this wrong (for example, by conflating ontology and terminology).
- There has not yet been much need for providing annotations in different languages. But this discussion is about the right principle of design, not necessarily immediate pragmatic demands.
- International journals may tend to employ a common natural language, but they obviously do not enforce uniform terminology. Moreover, machines do not use ontologies the same way humans use actual language to communicate, not even in professional or academic settings. So I don't find the cases all that analogous.