tnc icon indicating copy to clipboard operation
tnc copied to clipboard

LSIDs for taxonomic names live again

Open rdmpage opened this issue 3 years ago • 42 comments

TL;TR You can resolve LSIDs for taxonomic names here: https://lsid.herokuapp.com

Sorry for gatecrashing, but this might of interest. Given that there are millions of taxonomic names with LSIDs, most of which no longer resolve using the LSID protocol, it's always bothered me that we've let LSIDs die. So, I've made a website Life Science Identifier (LSID) Resolver that serves up the metadata for each LSID for names from three datasets (IPNI, Index Fungorum, and ION). These are all sources that used to support LSIDs, still display LSIDs, and in some cases still make the metadata available using the TDWG LSID vocabulary (if not via the LSID protocol).

The metadata is cached so the LSIDs resolve regardless of whether the source database supports the LSID protocol. Might be fun to compare the metadata from these LSIDs with what any new TNC comes up with. Note that there are some issues with the metadata, including mistakes and/or inconsistencies in the namespaces, and how the XML was constructed. I suspect these occurred because nobody ever actually used it.

I hope to add other LSIDs as time permits, and also depending on whether the database still provides metadata for LSIDs in TDWG LSID RDF.

rdmpage avatar Mar 10 '21 11:03 rdmpage

This is marvelous, @rdmpage ! Do you have any plans to add ZooBank to the list of supported sources? Perhaps there is already a resolver for it somewhere else, but if so, it isn't apparent from the ZooBank website.

baskaufs avatar Mar 10 '21 13:03 baskaufs

@baskaufs Glad you like it! By default I'm concentrating on sources that have RDF XML currently (or recently) available. I'm also biased towards integer identifiers (makes storing the data in chunks a bit easier, the whole thing data and all is in GitHub https://github.com/rdmpage/lsid-cache).

ZooBank stopped resolving LSIDs a long time ago :( If @deepreef restores that feature (even if just the RDF XML output) I could add ZooBank to the list, alternatively I'd have to make my own mapping between the JSON currently served by ZooBank and the TDWG LSID vocabulary, which is possible but slightly undermines the notion that I'm caching authoritative LSID metadata.

Personally I'm still baffled how our community decided that (a) the best identifier for a taxonomic name is an LSID and yet (b) made no attempt to persist either the identifiers or their associated metadata...

rdmpage avatar Mar 10 '21 13:03 rdmpage

FYI I've managed to find a copy of a ZooBank LSID record in XML:

<?xml version="1.0"?>
<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:tto="http://rs.tdwg.org/ontology/voc/Specimen#" xmlns:tc="http://rs.tdwg.org/ontology/voc/TaxonConcept#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:tn="http://rs.tdwg.org/ontology/voc/TaxonName#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:tpub="http://rs.tdwg.org/ontology/voc/PublicationCitation#" xmlns:trank="http://rs.tdwg.org/ontology/voc/TaxonRank#" xmlns:tcom="http://rs.tdwg.org/ontology/voc/Common#">
  <tn:TaxonName rdf:about="urn:lsid:zoobank.org:act:A1AE7A00-32C6-4510-A1D6-6DDDA9129D8B">
    <dc:title>Ectenopsis mackerrasi</dc:title>
    <owl:versionInfo>1.1.2.1</owl:versionInfo>
    <tn:nameComplete>Ectenopsis mackerrasi</tn:nameComplete>
    <tn:genusPart>Ectenopsis</tn:genusPart>
    <tn:specificEpithet>mackerrasi</tn:specificEpithet>
    <tn:year>1996</tn:year>
    <tn:publication>
      <tpub:PublicationCitation>
        <tpub:publicationType rdf:resource="JournalArticle" />
        <tpub:parentPublication rdf:resource="urn:lsid:zoobank.org:pub:2B273330-A0BE-4BA7-8D41-5F49A5099DFC" />
        <dc:identifier>urn:lsid:zoobank.org:pub:1A71CBE3-0D39-471A-8F05-A5D87573591D</dc:identifier>
        <tpub:authorship>Burger, John F.</tpub:authorship>
        <tpub:year>1996</tpub:year>
        <tpub:title>A new species of Ectenopsis (Paranopsis) (Diptera: Tabanidae) from New Zealand and a key to species of the subgenus Paranopsis</tpub:title>
        <tpub:parentPublicationString>Proceedings of the Entomological Society of Washington</tpub:parentPublicationString>
        <tpub:volume>98</tpub:volume>
        <tpub:number>2</tpub:number>
        <tpub:pages>264-266</tpub:pages>
      </tpub:PublicationCitation>
    </tn:publication>
    <tn:rank rdf:resource="http://rs.tdwg.org/ontology/voc/TaxonRank#Species" />
    <tn:rankString>Species</tn:rankString>
    <tn:nomenclaturalCode rdf:resource="http://rs.tdwg.org/ontology/voc/TaxonName#ICZN" />
  </tn:TaxonName>
  <tpub:PublicationTypeTerm rdf:about="JournalArticle" />
  <tpub:PublicationCitation rdf:about="urn:lsid:zoobank.org:pub:2B273330-A0BE-4BA7-8D41-5F49A5099DFC" />
  <trank:TaxonRankTerm rdf:about="http://rs.tdwg.org/ontology/voc/TaxonRank#Species" />
  <tn:NomenclaturalCodeTerm rdf:about="http://rs.tdwg.org/ontology/voc/TaxonName#ICZN" />
</rdf:RDF>

The current ZooBank JSON API returns this for A1AE7A00-32C6-4510-A1D6-6DDDA9129D8B:

[
  {
    "tnuuuid": "a1ae7a00-32c6-4510-a1d6-6ddda9129d8b",
    "OriginalReferenceUUID": "1a71cbe3-0d39-471a-8f05-a5d87573591d",
    "protonymuuid": "a1ae7a00-32c6-4510-a1d6-6ddda9129d8b",
    "label": "mackerrasi Burger 1996",
    "value": "mackerrasi Burger 1996",
    "lsid": "urn:lsid:zoobank.org:act:A1AE7A00-32C6-4510-A1D6-6DDDA9129D8B",
    "parentname": "Ectenopsis",
    "namestring": "mackerrasi",
    "rankgroup": "Species",
    "usageauthors": "Burger",
    "taxonnamerankid": "70",
    "parentusageuuid": "de501acd-28db-42b9-9ed8-f1ff0926bda5",
    "cleanprotonym": "Ectenopsis mackerrasi Burger, 1996",
    "NomenclaturalCode": "ICZN"
  }
]

So, the mapping is less than straightforward :(

rdmpage avatar Mar 10 '21 14:03 rdmpage

Thanks for tagging me on this! Like @rdmpage , I have been bothered by the state of LSIDs -- but from the opposite direction. I am bothered that we still mint them as though the community uses them (in the way they were intended to be used). The reason I haven't bothered to maintain the LSID resolver for ZooBank is that there was basically only one person who ever accessed them using the LSID resolution protocol (hint: it's the same person who started this thread). I'm more than happy to get it working again, if there is some desire to actually use that protocol for resolving content. @rdmpage : you used to have a WONDERFUL LSID resolution testing service -- is that still functional? If you can point me to that service (which I'll need to test the ZooBank LSID resolver service), I'll get the ZooBank LSID resolver working again.

The last time we discussed this, we ended up agreeing to abandon the LSID protocol, and instead create an RSS feed: http://zoobank.org/rss/rss.xml That continued to work up until last July, when we moved ZooBank to a new server and I forgot to update the login credentials for the service, so it stopped working. And the outcry from the user community in response to it no longer functioning can best be described as "deafening silence". But it was super easy to update the login credentials for the service just now, so it's now working again. I won't know for 24 hours whether it is correctly refreshing every 24 hours. But if it's not, I expect the outcry from the community to be the same as it was last July.

Snarky commentary aside, this is actually PERFECT timing. I had a long chat this morning with the COL ISG and one of the key topics was mobilizing ZooBank and GNUB to be more tightly integrated with COL/GBIF ChecklistBank. We had done a lot of work on that before, which stalled on November 18 2019 when our server system was hit by ransomware. After we solved that issue, we found ourselves in the middle of a global pandemic and re-adjusted priorities. As it happens, the cycle of priorities have looped around again such that ZooBank is back near the top.

The key next steps is to get these two datasets live again: https://www.gbif.org/dataset/c8227bb4-4143-443f-8cb2-51f9576aff14 https://www.gbif.org/dataset/34a96ebe-e51c-4222-9d08-5c2043c39dec

IPT is already up and running on our server for both, and the last step needed to flip the switch is to find a moment for @mdoering and I to connect and hack the config file and make them live again (perhaps next week). ZooBank was last refreshed on the day of the ransomware attack, and GNUB has been down since May 2015 (the outcry from that was the same as the RSS feed going down). Both should be up and live again by next week.

So, after getting the RSS feed and the two IPT datasets up and running again, this is my question to @rdmpage and @baskaufs and anyone else who is interested: What would you like next?

  • Re-establish the LSID resolver?
  • Harmonize the JSON in the ZooBank API with the LSID record?
  • Harmonize the RSS content feed with the API and/or LSID record?
  • Normalize them all to the DWC template published through the IPT?
  • Something else?
  • All of the above?

Tell me what you want, and I'll get 'er done. I don't even mind doing it for a client base of one (or two or three) -- I just want to make it easy to access and use the content. The reason the JASON API looks so clunky is that it never got past the "proof of concept" phase. If you give me the exact JSON output template you want, I can have that up and running for you. Of course, if I change the existing API, it will break any code that was build around the existing structure. But I have no doubt what the outcry to that will be (more deafening silence), so I'm ready to completely change the output template of all of these data access services (IPT, RSS, JSON API, and even LSID -- if people really want that) so they provide exactly the same content.

Let's do this.

deepreef avatar Mar 10 '21 19:03 deepreef

By the way, another reason this is perfect timing is that we're planning the next-generation ZooBank in the context of the 5th Edition of the ICZN Code. One of the items on that list of improvements was to (once and for all) abandon the LSID protocol for identifiers. We're still committed to maintaining the ones we've already minted into perpetuity (at least as identifiers; if not as an LSID resolution service); but after a certain date in the year 202X, the plan is to only issue the UUIDs. This approach was based on the assumption that LSIDs were dead in our community. But based on this thread, I'm wondering if news of the demise of LSIDs has been greatly exaggerated... Do we want to recommit to them? Or should we drive the wooden stake through it's heart once and for all and embrace something else (my preference: UUIDs for everything, wrapped within the DOI dereferencing infrastructure).

deepreef avatar Mar 10 '21 19:03 deepreef

One final thing:

Personally I'm still baffled how our community decided that (a) the best identifier for a taxonomic name is an LSID and yet (b) made no attempt to persist either the identifiers or their associated metadata...

You and me both! We all got together for two different workshops to discuss it. @rdmpage gave a great presentation on why LSIDs suck, DOIs suck and PURLs suck (I think those were the three -- I just remember that they all definitely sucked). At the end of those workshops, we all decided that LSIDs sucked the least, so we decided to go for it (largely because they were developed and backed by "IBM" -- so clearly were going to be around for the long haul -- yeah, right...). Lee Belbin convinced Paul Kirk, Nicky Nicholson and I to embrace LSIDs as a way to kick-start the community interest and understanding in them by showcasing them in IF, IPNI and ZooBank (respectively). COL wasn't far behind in adopting them. It all seemed so promising at the time.

Sigh

deepreef avatar Mar 10 '21 19:03 deepreef

@deepreef Hi Rich, from my perspective it would be great to have the LSID XML available, even if just via an API call rather than full blown LSID resolution. That way I could cache it and have essentially instantaneous access to the four main LSID providers. The LSID tester you mention is long dead, but some of its code lives on in http://www.lsid.info/resolver/ which could be used to help debug LSIDs.

rdmpage avatar Mar 10 '21 19:03 rdmpage

On the other things it seems to me inevitable that any serious attempt to issue identifiers for taxonomic names should use DOIs. I have never liked UUIDs, I think they are anti-user and send exactly the wrong message if you want to encourage adoption (identifiers are ugly, for computers only, and disposable), but I know @deepreef and I will never agree on this ;)

rdmpage avatar Mar 10 '21 19:03 rdmpage

@deepreef As far as I'm concerned LSIDs are dead and it does not seem like it is worth maintaining an infrastructure that mints any more of them. I'm mostly concerned about them as a sort of archival issue. In other words, is there ANY way to recover the information they were supposed to provide if someone were to read an old paper that used them and wanted to get whatever information they were supposed to provide. That is what @rdmpage's tool does, subject to actually having access to the underlying data.

baskaufs avatar Mar 10 '21 19:03 baskaufs

OK, thanks @rdmpage -- so it's not so much about the LSID resolution protocol as it is to get the content in XML format similar to the the LSID template? That should be a lot easier, I imagine.

Above you gave two examples of output, the LSID template and the JSON template. Again, the latter was just a proof of concept that we never finished (mostly Rob Whitton wanting to get his head around how to implement JSON). After we built it, we put out the call for feedback on how to modify the structure to represent it in ways that people would find useful. Again, the response was deafening silence, so we never followed up with it.

So... let's assume that nobody is using the LSID resolution protocol, so we don't need to resurrect that. Let's also assume that nobody is using the ZooBank APIs, so I can re-develop those without breaking anyone's existing code (or I can keep a legacy version if people really want and use that crappy JSON template). And finally, let's assume I will commit to doing the necessary work to make it happen (like I said, the timing is good as I'm mucking around with the IPT now anyway.

If we assume all of those things, then it makes a lot of sense to me to harmonize at least the output content for IPT, XML and JSON. IPT is the only one following a real, active standard (DwC), so let's use that as the "core" content. DwC lacks a literature standard (something we've always wanted) so maybe I can just use the terms as they are in the LSID template. My thinking is that ipt will continue to do its thing (via http://ipt.zoobank.org:8080/ipt -- not quite functional yet, but it will be after I synch with @mdoering). Then I'll base two APIs off the same content, one that outputs in XML, and one that outputs in JSON.

Here's what I need help with:

  1. Someone needs to provide me with templates for both XML and JSON with a sample record showing me exactly what you want the output to look like. I have a rough idea, but I need the actual consumers of this stuff to tell me what they want -- rather than me trying to guess and hoping what I build meets your needs.

  2. Help me decide the endpoints to access the content. I'm a bit out of my depth on best practices here, but I think it's important to understand that ZooBank content is a subset of GNUB content. As discussed in another issue on GitHub, limiting the content to ZooBank is artificial, as links to parentUsageID don't work unless both the genus and the species are both established within the same publication. I think a much better approach is to simply present all the content as GNUB, including both the ZooBank stuff and non-ZooBank stuff. So by my primitive thinking, the endpoints for these services would be something like:

http://gnub.org/a1ae7a00-32c6-4510-a1d6-6dddA9129d8b.xml and http://gnub.org/a1ae7a00-32c6-4510-a1d6-6dddA9129d8b.json [UUIDs would be case-insensitive for the service, so it wouldn't matter if you used the uppercase or lowercase versions of the UUIDs].

Is that the best way to do it? Or would it be better to go with something like: http://gnub.org/xml/a1ae7a00-32c6-4510-a1d6-6dddA9129d8b and http://gnub.org/json/a1ae7a00-32c6-4510-a1d6-6dddA9129d8b

Or maybe:

http://gnub.org/tnu/a1ae7a00-32c6-4510-a1d6-6dddA9129d8b.xml and http://gnub.org/tnu/a1ae7a00-32c6-4510-a1d6-6dddA9129d8b.json [e.g., if we want to have different services for tnus vs. pubs vs. authors]

Maybe it doesn't matter (in which case I'll go with the first option, because it seems clean to me); or maybe it does matter (in which case someone needs to tell me what it should be).

I know it's bad GitHub etiquette to write such long posts, but you all know me well enough to know that I don't care about GitHub etiquette (I'm going to spell this stuff out explicitly no matter what, so get over it). But I'm serious about rebuilding this stuff right -- meaning in a way that is useful enough that the user base may eventually expand beyond two or three clients.

deepreef avatar Mar 10 '21 20:03 deepreef

@baskaufs : Yes! As I said, we're committed to maintaining the "identity" part of LSIDs into perpetuity (if not the resolution protocol part). That was one of the things I wanted to achieve through http://bioguid.org Taking the example from @rdmpage : http://bioguid.org/searchIdentifier?q=a1ae7a00-32c6-4510-a1d6-6dddA9129d8b&format=html

BTW, that is another @rdmpage - inspired service that almost got off the ground, then went into hibernation for a few years, but sometime within the next year or two I plan to bring it back to life again (with gusto!)

But that's a topic for another thread...

deepreef avatar Mar 10 '21 20:03 deepreef

@deepreef From the perspective of the LSID archive ideally XML like the example I showed above https://github.com/tdwg/tnc/issues/117#issuecomment-795527382 (which was actually retrieved from ZooBank when its LSID service was live). The structure and vocabulary of that file closely match IPNI, Index Fungorum, and ION, which makes integrating all sources of data much easier.

If nothing else, if we get ZooBank added it means that the millions of LSIDs for names in the wild, including those which presumably have some nomenclatural significance would all be "resolvable".

So, would it be possible to serve XML like https://github.com/tdwg/tnc/issues/117#issuecomment-795527382 for each taxon name? Maybe the original code for this still exists in the ZooBank source code? I have no preference for API interface, presumably something like http://zoobank.org/NomenclaturalActs.xml/6EA8BB2A-A57B-47C1-953E-042D8CD8E0E2 would be consistent with the current API?

rdmpage avatar Mar 10 '21 21:03 rdmpage

OK, I'll use that as a starting point. You said it "closely matches" IPNI, IF and ION. Can we bump that up to "exactly matches" to make it even easier? If I'm going to need to build it anyway, I might as well add any additional tweaks to improve it in any way you wish. I'll start with the template as you presented it above, but I assume it won't break anything if I add additional properties (as long as I don't change the existing ones) -- is that a safe assumption?

Framing it as ZooBank is artificially constrictive, and will lead to broken links from parentUsageID (assuming I add that property to the XML output). Why not apply it to the entire GNUB content? Here are some comparisons of numbers:

Class ZooBank GNUB
Protonym TNUs 279,245 385,967
Non-Protonym TNUs 0 831,532
References 121,742 156,341

All of the ZooBank content is included among the GNUB content. The only difference is that the ZooBank records have both a UUID and an LSID (and also a little bit of metadata, such as when the content was registered), whereas the GNUB records only have the UUIDs. If we could just add one property to indicated that a given record was registered in ZooBank, then it seems to me that the GNUB content would make the most sense to scope the service for.

I guess it wouldn't hurt to do both as separate services (one at zoobank.org, and one at gnub.org), but that seems pretty redundant when the gnub version already includes everything in the zoobank version.

Yes, the original code does already exist, so it won't be too hard to resurrect it exactly as is.

One last thing, though: we're talking about "resolving" the LSIDs, but your example uses the UUID. My assumption that both will work, but my question is whether the uuid should be presented in the output as a separate identifier, or just leave it to the end-user to harvest it from the LSID.

So, just to be clear: my current plan is to implement an interface that returns the exact same XML as you listed above, but directly (rather than through the LSID protocol). I'll make it so you get the same results for any of these: http://zoobank.org/NomenclaturalActs.xml/6ea8bb2a-a57b-47c1-953e-042d8cd8e0e2 http://zoobank.org/NomenclaturalActs.xml/6EA8BB2A-A57B-47C1-953E-042D8CD8E0E2 http://zoobank.org/NomenclaturalActs.xml/urn:lsid:zoobank.org:act:6EA8BB2A-A57B-47C1-953E-042D8CD8E0E2

Once I get that working, then we can move on to the next questions:

  1. Should I make the JSON API provide the same content and terms?
  2. Should we create similar services for all of GNUB content?
  3. What's the best interface system to use for the new (GNUB) APIs?

deepreef avatar Mar 11 '21 00:03 deepreef

@deepreef From my perspective I'd just like the ZooBank LSIDs (I think of my services as a "wayback machine" for LSIDs). So my preference is not to include additional links to GNUB, but obviously that's up to you. I'd only harvest ZooBank LSIDs as they are the only ones that are likely to be in the wild (e.g., in publications or referred to in external databases such as Wikidata).

Regarding XML, the original example I gave above could be tweaked as it has some issues. In particulate, it doesn't link the publication to the name (except indirectly via a bnode). At the moment you have something like this:

<?xml version="1.0"?>
<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:tto="http://rs.tdwg.org/ontology/voc/Specimen#" xmlns:tc="http://rs.tdwg.org/ontology/voc/TaxonConcept#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:tn="http://rs.tdwg.org/ontology/voc/TaxonName#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:tpub="http://rs.tdwg.org/ontology/voc/PublicationCitation#" xmlns:trank="http://rs.tdwg.org/ontology/voc/TaxonRank#" xmlns:tcom="http://rs.tdwg.org/ontology/voc/Common#">
  <tn:TaxonName rdf:about="urn:lsid:zoobank.org:act:A1AE7A00-32C6-4510-A1D6-6DDDA9129D8B">
    <dc:title>Ectenopsis mackerrasi</dc:title>
    <tn:publication>
      <tpub:PublicationCitation>
        <tpub:publicationType rdf:resource="JournalArticle" />
        <tpub:parentPublication rdf:resource="urn:lsid:zoobank.org:pub:2B273330-A0BE-4BA7-8D41-5F49A5099DFC" />
        <dc:identifier>urn:lsid:zoobank.org:pub:1A71CBE3-0D39-471A-8F05-A5D87573591D</dc:identifier>
        <tpub:title>A new species of Ectenopsis (Paranopsis) (Diptera: Tabanidae) from New Zealand and a key to species of the subgenus Paranopsis</tpub:title>
      </tpub:PublicationCitation>
    </tn:publication>
  </tn:TaxonName>
  <tpub:PublicationCitation rdf:about="urn:lsid:zoobank.org:pub:2B273330-A0BE-4BA7-8D41-5F49A5099DFC" />
</rdf:RDF>

servlet_4738409711555504906

whereas I think you want something like this:

<?xml version="1.0"?>
2: <rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:tto="http://rs.tdwg.org/ontology/voc/Specimen#" xmlns:tc="http://rs.tdwg.org/ontology/voc/TaxonConcept#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:tn="http://rs.tdwg.org/ontology/voc/TaxonName#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:tpub="http://rs.tdwg.org/ontology/voc/PublicationCitation#" xmlns:trank="http://rs.tdwg.org/ontology/voc/TaxonRank#" xmlns:tcom="http://rs.tdwg.org/ontology/voc/Common#">
3:   <tn:TaxonName rdf:about="urn:lsid:zoobank.org:act:A1AE7A00-32C6-4510-A1D6-6DDDA9129D8B">
4:     <dc:title>Ectenopsis mackerrasi</dc:title>
5:     <tn:publication>
6:         <rdf:Description rdf:about="urn:lsid:zoobank.org:pub:1A71CBE3-0D39-471A-8F05-A5D87573591D">
7:         <rdf:type rdf:resource="http://rs.tdwg.org/ontology/voc/PublicationCitation#PublicationCitation"/>
8: 		<dc:identifier>urn:lsid:zoobank.org:pub:1A71CBE3-0D39-471A-8F05-A5D87573591D</dc:identifier>
9:         <tpub:title>A new species of Ectenopsis (Paranopsis) (Diptera: Tabanidae) from New Zealand and a key to species of the subgenus Paranopsis</tpub:title>                
10:        </rdf:Description>
11:     </tn:publication>
12:   </tn:TaxonName>
13: </rdf:RDF>

servlet_4193915353756474619

The difference is that now we are explicitly making the link between the taxon name and publication LSIDs. RDF XML is horrible, the W3C validator is useful for figuring out if you're doing it right (it took me a few goes).

The sooner we all move to JSON-LD and Bioschemas the better ;)

rdmpage avatar Mar 11 '21 08:03 rdmpage

two different workshops to discuss it. @rdmpage gave a great presentation on why LSIDs suck, DOIs suck and PURLs suck (I think those were the three -- I just remember that they all definitely sucked).

Are the presentations from these workshops or the presentation by @rdmpage still available somewhere, by any chance?

cboelling avatar Mar 11 '21 09:03 cboelling

@cboelling This was 2005-2006 as I recall, and whatever I said then is probably stuck somewhere on a ZIP file or a DVD! My recollection at the time was that we looked at DOIs, Handles, PURLs, and LSIDs.

The discussion was heavily driven by costs, so DOIs were seen as problematic as they were expensive. Ironically, DOIs were already in use at the time by NamesforLife (N4L) a company set up by George M. Garrity (who was at the meeting) to manage bacterial names and taxonomy. For example, doi:10.1601/nm.3093 is the name Escherichia coli, and doi:10.1601/tx.3093 is the corresponding taxon. Imagine if we'd gone down this route and hand DOIs for every Eukaryote taxonomic name... oh well.

Handles are DOIs without the branding and with minimal costs, but you have to mange them using clunky software. PURLs just move managing persistence somewhere else using someone else's brand and worse tools. LSIDs had the advantage of being free, they keep your organisation brand, and by serving RDF they forced nomenclators to standardise on a data format (the TDWG LSID vocabulary). But their dependency on messing with DNS and using SOAP made then beyond the reach of many biodiversity developers.

As is typical in these discussions when the participants have no money, the free solution won. If you don't value the solution (i.e., won't spend money on it) then why would anyone else value it?

Personally I think we missed the absolutely key challenges, which are to:

  1. provide value for users (what do you get if you resolve the identifier?)
  2. provide services on top of the identifier (what else does the identifier give me?)
  3. engineer network effects to encourage identifier adoption (so that people feel left out if they don't use the identifier).

DOIs are the shiny example of doing this right, LSIDs, not so much. The challenge is to make sure you have 1-3, once you have that then the actual identifier technology doesn't matter so much (but of course, some have brand recognition, which is why DOIs are taking over the world).

rdmpage avatar Mar 11 '21 12:03 rdmpage

@rdmpage : EXCELLENT! This is exactly the sort of feedback I was hoping for.

OK, I decided sleep wasn't necessary tonight, so I went ahead and built version 1 of the service, incorporating your requested tweaks. I made as couple of other minor changes changes:

  • authorship is now included in the dc:title for the name.
  • I added dc:identifier for the name as well as the pub
  • I added your suggested publication structure, but for now I also left the original tpub:PublicationCitation...</tpub:PublicationCitation> content in place so that the parsed reference citation data are included. I can obviously remove these if they represent a problem -- which I think it is, based on the results from the W3C Validator.

In any case, have a look and let me know if this works to your needs: http://zoobank.org/NomenclaturalActs.xml/A1AE7A00-32C6-4510-A1D6-6DDDA9129D8B The heavy lifting is done, so modifications are super easy to make from here.

I have NOT tested this extensively! I tried to trap for ampersands and html tags and whatnot, but I might have missed some, so there may be errors. Please let me know if you find problematic records

Questions:

  1. I'm assuming that I should only include tags for actual content, correct? If there is no volume (for example), then I should not include an empty tpub:volume</tpub:volume> pair of tags -- correct?

  2. Do I want to include additional dc:identifiers when I have them? E.g. include the uuid separately from the LSID? Include DOIs when I have them for the pubs? Include other identifiers when I have them for the other stuff?

On a final note: Within the next couple years (coinciding with Code-5), ZooBank will likely stop wrapping the uuids within the cumbersome and unnecessary LSID prefixes. From that point forward, the plain uuids will be in the wild (they already are -- they just happen to be prefixed by the LSID stuff).

OK, probably time for some sleep now.

deepreef avatar Mar 11 '21 12:03 deepreef

Imagine if we'd gone down this route and hand DOIs for every Eukaryote taxonomic name... oh well.

It's not too late! I can always replace the urn:lsid:zoobank.org:[pub|act|author]: prefix with a 10.xxxxx/ prefix. All I need to do is get a xxxxx for ZooBank. Right?

deepreef avatar Mar 11 '21 12:03 deepreef

@deepreef Cool, I will take a look. From my perspective, in a ZooBank LSID it's not the LSID prefix that is cumbersome... it's the UUID. I think if you (a) adopt DOIs for names and (b) drop the UUID and have a nice short user-friendly string (can be opaque) you would do wonders for the adoption of persistent identifiers for zoological names.

rdmpage avatar Mar 11 '21 12:03 rdmpage

I agree. Even though it feels silly I have the same reservation for UUIDs. That's why we decided in COL to use short alphanumerical strings that try not to resemble real words and avoid easily confused char pairs: https://github.com/CatalogueOfLife/backend/issues/491

They can also be converted to ints for a more memory or db friendly incarnation.

mdoering avatar Mar 11 '21 12:03 mdoering

OK @deepreef I hoping that you're getting some sleep now ;)

Here is my version of what ZooBank XML should look like, with comments to explain why I've made the changes.

<?xml version="1.0"  encoding="UTF-8"?>
<rdf:RDF 
    xmlns:dc="http://purl.org/dc/elements/1.1/" 
    xmlns:owl="http://www.w3.org/2002/07/owl#" 
    xmlns:tto="http://rs.tdwg.org/ontology/voc/Specimen#" 
    xmlns:tc="http://rs.tdwg.org/ontology/voc/TaxonConcept#" 
    xmlns:dcterms="http://purl.org/dc/terms/" 
    xmlns:tn="http://rs.tdwg.org/ontology/voc/TaxonName#" 
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
    xmlns:tpub="http://rs.tdwg.org/ontology/voc/PublicationCitation#" 
    xmlns:trank="http://rs.tdwg.org/ontology/voc/TaxonRank#" 
    xmlns:tcom="http://rs.tdwg.org/ontology/voc/Common#">
    <tn:TaxonName rdf:about="urn:lsid:zoobank.org:act:A1AE7A00-32C6-4510-A1D6-6DDDA9129D8B">
        <dc:title>Ectenopsis mackerrasi Burger, 1996</dc:title>
        <dc:identifier>urn:lsid:zoobank.org:act:A1AE7A00-32C6-4510-A1D6-6DDDA9129D8B</dc:identifier>
        <owl:versionInfo>1.1.2.1</owl:versionInfo>
        <tn:nameComplete>Ectenopsis mackerrasi</tn:nameComplete>
        <tn:genusPart>Ectenopsis</tn:genusPart>
        <tn:specificEpithet>mackerrasi</tn:specificEpithet>
        <tn:year>1996</tn:year>
        <!-- there isn't any such term as tn:publication, even though Index Fungorum uses it, it should be tcom:publishedInCitation -->
        <!-- <tn:publication> -->
        <tcom:publishedInCitation>
            <rdf:Description rdf:about="urn:lsid:zoobank.org:pub:1A71CBE3-0D39-471A-8F05-A5D87573591D">
                <rdf:type rdf:resource="http://rs.tdwg.org/ontology/voc/PublicationCitation#PublicationCitation"/>
                <dc:identifier>urn:lsid:zoobank.org:pub:1A71CBE3-0D39-471A-8F05-A5D87573591D</dc:identifier>
                <tpub:title>A new species of Ectenopsis (Paranopsis) (Diptera: Tabanidae) from New Zealand and a key to species of the subgenus Paranopsis</tpub:title>
            <!-- the rdf:Description tag encloses everything about the publication, and already says it is of type tpub:PublicationCitation -->     
            <!-- </rdf:Description>
            <tpub:PublicationCitation> -->
                <!-- need to add namespace for publication type -->
                <tpub:publicationType rdf:resource="http://rs.tdwg.org/ontology/voc/PublicationCitation#Journal Article" />
                <tpub:parentPublication rdf:resource="urn:lsid:zoobank.org:pub:2B273330-A0BE-4BA7-8D41-5F49A5099DFC" />
                <tpub:authorship>Burger, John F.</tpub:authorship>
                <tpub:year>1996</tpub:year>
                <tpub:title>A new species of Ectenopsis (Paranopsis) (Diptera: Tabanidae) from New Zealand and a key to species of the subgenus Paranopsis</tpub:title>
                <tpub:parentPublicationString>Proceedings of the Entomological Society of Washington,  (Proc. Ent. Soc. Wash.)</tpub:parentPublicationString>
                <tpub:volume>98</tpub:volume>
                <tpub:number>2</tpub:number>
                <tpub:pages>264-266</tpub:pages>
            <!-- </tpub:PublicationCitation> -->
                </rdf:Description>
        <!-- </tn:publication> -->
        </tcom:publishedInCitation>
        <tn:rank rdf:resource="http://rs.tdwg.org/ontology/voc/TaxonRank#Species" />
        <tn:rankString>Species</tn:rankString>
        <tn:nomenclaturalCode rdf:resource="http://rs.tdwg.org/ontology/voc/TaxonName#ICZN" />
    </tn:TaxonName>
    <!-- These are all superflous and are outside the scope of the document (i.e., they don't refer to the tn:TaxonName -->
    <!--
    <tpub:PublicationTypeTerm rdf:about="Journal Article" />
    <tpub:PublicationCitation rdf:about="urn:lsid:zoobank.org:pub:2B273330-A0BE-4BA7-8D41-5F49A5099DFC" />
    <trank:TaxonRankTerm rdf:about="http://rs.tdwg.org/ontology/voc/TaxonRank#Species" />
    <tn:NomenclaturalCodeTerm rdf:about="http://rs.tdwg.org/ontology/voc/TaxonName#ICZN" />
    -->
</rdf:RDF>

I've also made it into a gist https://gist.github.com/rdmpage/ea25baf487a17af4a2184f0ca5bef98b and you can look at the revisions to see the steps I took to change it. The RDF now validates.

Biggest change was to tidy up the publication, and use the correct TDWG term tcom:publishedInCitation ("tn:publication" isn't a thing, even though Index Fungorum uses it). There was also some stuff at the end of the document that needed to go. I'd forgotten just how awful RDFXML is to work with.

rdmpage avatar Mar 11 '21 13:03 rdmpage

@deepreef Oops, forgot your other questions. If there's no info then I would simply not include the corresponding tag, so no volume, no tag.

Other identifiers, yes please, especially DOIs (elsewhere I'm harvesting ZooBank's DWCA to add DOIs and other identifiers, but it would be nice to have the ones ZooBank already knows about).

rdmpage avatar Mar 11 '21 13:03 rdmpage

The challenge is to make sure you have 1-3, once you have that then the actual identifier technology doesn't matter so much (but of course, some have brand recognition, which is why DOIs are taking over the world).

Thanks for this context @rdmpage. I agree and its vital to keep this in mind in current decision-making on identifier systems. Of course, the governance structures to actually achieve persistance of data and services are an essential part of any relevant solution.

cboelling avatar Mar 11 '21 14:03 cboelling

@cboelling Yes, governance matters, but I would argue providing value to users should be the primary driver. If something isn't useful and doesn't help people do what they want to do, then all the governance in the world won't help.

rdmpage avatar Mar 11 '21 15:03 rdmpage

@rdmpage @mdoering : On the uuid thing; well... we're just going to have to agree to disagree. Especially in taxonomy, we already have the "identifier" that is human-friendly (it's the scientific name itself). From the perspective of humans, these identifiers have worked spectacularly well (otherwise they wouldn't still be in use a quarter-millennium after they were launched). Humans have no problem accommodating things like misspellings, alternate genus combinations, homonyms and the like. Computers, of course, have different needs in identifiers. They need to be globally unique and explicitly attached to the associated metadata, and above all, they should never change. Sure, integers work great for things like foreign keys and such -- which is why every database I create (including GNUB/ZooBank) uses integer fields for primary and foreign keys. I even have a system that unambiguously links each integer primary key to its corresponding UUID. But there's a reason it's a very (VERY) bad idea to use a value of a primary key field as your globally unique identifier. We could debate this indefinitely (as we have for years already before, and as we no doubt will for years to come); but I'm much more interested in focusing this discussion on this:

Yes, governance matters, but I would argue providing value to users should be the primary driver. If something isn't useful and doesn't help people do what they want to do, then all the governance in the world won't help.

YES! YES! YES! Let's make stuff that people actually find useful! That's exactly why I was up until 2am this morning tweaking the XML service -- because someone might find it useful. It's also why I want to get the IPT up and running again, and why I'm eager to create JSON-LD service and leverage Bioschemas. I'm going to need a bit of hand-holding to get those up and running, asking lots of rookie-level questions like "should I include the tags if the content is empty" and such.

@rdmpage : THANK YOU -- that's EXACTLY what I needed: an explicit template to implement. I'll stop typing this post and start coding now. Back in a bit.

deepreef avatar Mar 11 '21 19:03 deepreef

Of course, the moment after I posted that last note, I realized I was late for my first (of many) Zoom meetings for the day, so coding got delayed. However, I just now had my first break, and went straight to the coding.

I followed your template: http://zoobank.org/NomenclaturalActs.xml/A1AE7A00-32C6-4510-A1D6-6DDDA9129D8B It seems to pass the WC3 Validator, so thank you for correcting the errors.

I also added additional identifiers, when I have them. I can display the identifiers either with the dereferencing metadata, or without. In some cases, it's obvious that I should include the dereferencing metadata, for example: Without: <dc:identifier>8831844</dc:identifier> With: <dc:identifier>http://www.gbif.org/species/8831844</dc:identifier>

In the case of LSIDs, the dereferencing metadata is built into the identifier itself (i.e., the urn:lsid:zoobank.org:act: part)

But what about DOIs? Should I include the dereferencing metadata, or not: Without: <dc:identifier>10.3897/zookeys.641.11500</dc:identifier> With: <dc:identifier>https://doi.org/10.3897/zookeys.641.11500</dc:identifier>

For now, I'm including it: http://zoobank.org/NomenclaturalActs.xml/18c72d73-00c3-40e4-b27f-fa7748a1251e But I can very easily remove it.

One Rookie question: Among the declared references in the opening RDF tag, some of the URLs have a hash at the end, and some don't. Is that a thing? Should I strip the ending hash characters? Add them to the ones that lack them? Leave them as is? Probably not important, but I'm just letting my OCD run wild on this.

Awaiting further instructions to do even more stuff that people will find useful....

deepreef avatar Mar 11 '21 22:03 deepreef

One other note: there are some data quality issues due to how the users enter data in a messy way. For example, the DOI is properly stored in the database as 10.3897/zookeys.641.11500; but people will sometimes enter it as "https://doi.org/10.3897/zookeys.641.11500" or "doi: 10.3897/zookeys.641.11500". It's on my to-do list to clean all these up in the master database, but for now there is a lot of noise in there, so you'll get things that look like these: https://doi.org/https://doi.org/10.3897/zookeys.641.11500 or https://doi.org/doi: 10.3897/zookeys.641.11500

If this is a problem, I'll bump the clean-up task up higher in the priority list.

deepreef avatar Mar 11 '21 22:03 deepreef

@deepreef Regarding the namespaces in the rdf:RDF tag, they can end in either a forward slash / or a hash #, depending on the choice made by whoever created that vocabulary. Given that this is the delimiter between the namespace name and the property you need to keep them, for example, http://purl.org/dc/elements/1.1/identifier (= dc:identifier) and http://www.w3.org/1999/02/22-rdf-syntax-ns#Description (= ref:Description). See HashVsSlash for background.

rdmpage avatar Mar 11 '21 23:03 rdmpage

@deepreef Regarding identifiers there are a bunch of ways to include and represent DOIs (that there are so many ways to do things is yet another reason RDF is hard work).

If you are going to use dc:identifier then my suggestion is to store it as a URL with the prefix https://doi.org/, so <dc:identifier>https://doi.org/10.3897/zookeys.641.11500</dc:identifier>.

rdmpage avatar Mar 11 '21 23:03 rdmpage

Thanks, @rdmpage

Given that this is the delimiter between the namespace name and the property you need to keep them, for example, http://purl.org/dc/elements/1.1/identifier (= dc:identifier) and http://www.w3.org/1999/02/22-rdf-syntax-ns#Description (= ref:Description).

I get that part (when used as a delimiter). I was talking about the terminal character in the URL; e.g.: "http://rs.tdwg.org/ontology/voc/TaxonConcept#" vs. "http://rs.tdwg.org/ontology/voc/TaxonConcept"

I'll assume they're there for a reason.

RE: DOIs: OK, I'll leave them with the http://doi.org/ prefix (dereferencing metadata)

deepreef avatar Mar 12 '21 00:03 deepreef