taxize icon indicating copy to clipboard operation
taxize copied to clipboard

Australian Plant Names Index (APNI), Australian Plant Census (APC) and Australian National Species List (NSL)

Open dfalster opened this issue 4 years ago • 30 comments

Thanks for the great package. Wondering about the possibility of connecting taxize to these Australian taxonomic resources, which are all accessible via a good APIs, as part of the Australian National Species List (NSL) infrastructure.

There are two main services, available via https://biodiversity.org.au/nsl/services/,

  • The Australian Plant Census (APC) provides a nationally-accepted taxonomy
  • The Australian Plant Name Index (APNI) provides names and bibliographic information.

As described at the above link "this section of the National Species List infrastructure delivers names and taxonomies for flowering plants, ferns, gymnosperms, hornworts, and liverworts. The data comprise names, bibliographic information, and taxonomic concepts for plants that are either native to or naturalised in Australia. ..... The taxonomy and nomenclature adopted for the APC are endorsed by the Council of Heads of Australasian Herbaria (CHAH)." There also a tree available at https://biodiversity.org.au/nsl/services/rest/tree/apni/51209179

The API is described https://biodiversity.org.au/nsl/docs/main.html

Having a programmatic interface in R to these resources would be a big deal for Australian research. If it's possible to add to taxize, this seems preferable to developing a separate package.

Can you let us know whether you think this would be possible @sckott ?

dfalster avatar Apr 06 '20 00:04 dfalster

thanks @dfalster !

I'll have a look into the docs. At first glance at the docs I think it will work, but i'll get back to you soon with further thoughts

Are there equivalent data sources for Australian animals?

sckott avatar Apr 06 '20 20:04 sckott

Actually, I played with the API a little bit but I don't see any real search capbability. For example, you can search on the website for APNI names here https://biodiversity.org.au/nsl/services/APNI but with the API I don't see any way to do the same thing. There's this https://biodiversity.org.au/nsl/docs/main.html#taxon-search API route, but it only appears to be get one name

curl -L -H "Accept: application/json" 'https://biodiversity.org.au/nsl/services/api/name/taxon-search?q=Acacia' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1695    0  1695    0     0    652      0 --:--:--  0:00:02 --:--:--   652
{
  "records": {
    "synonyms": [
      {
        "taxonID": "https://id.biodiversity.org.au/taxon/apni/51311124",
        "nameType": "scientific",
        "acceptedNameUsageID": "https://id.biodiversity.org.au/taxon/apni/51311124",
        "acceptedNameUsage": "Acacia Mill.",
        "nomenclaturalStatus": null,
        "taxonomicStatus": "accepted",
        "proParte": "false",
        "scientificName": "Acacia Mill.",
        "scientificNameID": "https://id.biodiversity.org.au/name/apni/56859",
        "canonicalName": "Acacia",
        "scientificNameAuthorship": "Mill.",
        "parentNameUsageID": "https://id.biodiversity.org.au/taxon/apni/51351217",
        "taxonRank": "Genus",
        "taxonRankSortOrder": "120",
        "kingdom": "Plantae",
        "class": "Equisetopsida",
        "subclass": "Magnoliidae",
        "family": "Fabaceae",
        "created": "2009-12-15 11:08:09.0",
        "modified": "2009-12-15 11:08:09.0",
        "datasetName": "APC",
        "taxonConceptID": "https://id.biodiversity.org.au/instance/apni/603762",
        "nameAccordingTo": "CHAH (2006), Australian Plant Census",
        "nameAccordingToID": "https://id.biodiversity.org.au/reference/apni/42942",
        "taxonRemarks": null,
        "taxonDistribution": "WA (native and naturalised), NT (native and naturalised), SA (native and naturalised), Qld (native and naturalised), NSW (native and naturalised), NI (naturalised), ACT (native and naturalised), Vic (native and naturalised)",
        "higherClassification": "Plantae|Charophyta|Equisetopsida|Magnoliidae|Rosanae|Fabales|Fabaceae|Acacia",
        "firstHybridParentName": null,
        "firstHybridParentNameID": null,
        "secondHybridParentName": null,
        "secondHybridParentNameID": null,
        "nomenclaturalCode": "ICN",
        "license": "http://creativecommons.org/licenses/by/3.0/",
        "ccAttributionIRI": "https://id.biodiversity.org.au/taxon/apni/51311124"
      }
    ],
    "acceptedNames": {}
  },
  "status": {
    "enumType": "org.springframework.http.HttpStatus",
    "name": "OK"
  }
}

wheres on the website you get many names in the results. I think we really need that fuzzy search capability to be able to make a get_apni() or get_apc() function - which then forms the basis for incorporating these data sources into other useful functions in taxize.

Another key thing I'd like to see is the ability to get children and parent taxa. From the output above it looks like we have parent information which is good, but not seeing a way to get taxonomic children. Do you see that in the docs?

sckott avatar Apr 06 '20 20:04 sckott

Hi @sckott

Thanks so much for taking a look so promptly. Really appreciate it. We're looking for a tool to query the APC and APNI programmatically. I suspect if we got this going it would be used widely within Australia, not only by my group :).

You're right, I can't see how to do search a list of taxa in the docs. (But note, I don't really know what I'm looking for either, as APIs are not something I'm good at. What would such a query look like?)

As for the children, I can see that if you select a species, you can get the parent, and if you click on the parent you can get the children. From your example above, if you follow the link that is returned for Acaia, https://id.biodiversity.org.au/taxon/apni/51311124, you get a page that lists all the children. image

So I wonder if it's a matter of first locating the id, then fetching the children?

If you can outline the interface needed, I can enquire whether it is possible with relevant people.

dfalster avatar Apr 07 '20 06:04 dfalster

thanks, having a look

sckott avatar Apr 07 '20 22:04 sckott

been working on some utility functions, install.packages("ropensci/taxize@australian"), then see ?apni - Making progress.

  • the acceptable_names function seems to do a good job of fuzzy searching, which we need
  • the apni_classification allows us to get the taxonomic classifcation for a taxon id, which we need
  • I still haven't found a way to get children as you demonstrated in the browser flow above. We definitely need a API method to get children of a taxon to make this as useful as possible

sckott avatar Apr 07 '20 23:04 sckott

Fantastic. Just tried it out, looking useful already!

So if I understand right, the two main features missing from the API are

  1. Ability to search a list of names, e.g.
> apni_search(q = c("Acacia", "Eucalyptus"))
Error: Bad Request (HTTP 400)

Can we solve this one by vectorising on the taxize side?

  1. Ability to access children? If we take acacia as an example, the id is 51311124, so https://biodiversity.org.au/nsl/services/rest/taxon/apni/51311124 is the web page with children. Adding .json to the end of this gives the data, which seems to give children?
x <- jsonlite::read_json("https://biodiversity.org.au/nsl/services/rest/taxon/apni/51311124.json") 
x$treeElement$children[[1]]

Also, couldn't see how to extract the apni_id for given taxa after retrieving search results.

Again, I don't know APIs so the above may be off track.

dfalster avatar Apr 08 '20 00:04 dfalster

yes, can vectorize the fxns, just hadn't gotten to that yet.

nice, I think that children solution should work, will try that

sckott avatar Apr 08 '20 01:04 sckott

Another question, how does taxize handle searches when there are spelling mistakes? I notice the APNI just returns "no results".

image

This is a very common issue with taxonomic name searches. In the past I have used Taxonstand, which included an argument max.distance: A number indicating the maximum distance allowed for a match in agrep when performing corrections of spelling errors in specific epithets. Guessing you need this on server side, so if not possible in the web interface, won't be possible via Taxize.

dfalster avatar Apr 08 '20 01:04 dfalster

taxize doesn't do anything automatically regarding spelling mistakes on the R side. i consider that a separate step for sure, distinct from searching one of the data sources. data sources vary widely in how they handle spelling. some do fuzzy search in which they account for possible spelling mistakes and return the closest matches, while some data sources do not fuzzy match and so return nothing or similar on no resuts found. when no results are found we typically give back NA or similar

there are specific functions in taxize to "resolve" names. eg,. gnr_resolve() and tnrs(). i'd suggest running names through a resolver function first if there's concern there may be spelling mistakes. i wish there was a better solution.

sckott avatar Apr 08 '20 19:04 sckott

Thanks for explaining. Makes sense. I can sort something for some fuzzy matching locally.

I now have contact details for the folks behind the APC/APNI service, so can put you in contact or deliver questions there, as needed

dfalster avatar Apr 08 '20 21:04 dfalster

Good news about getting contacts. I'll work on this soon and see if there's any questions I have

sckott avatar Apr 09 '20 19:04 sckott

When asked whether we should link to APC, APNI or both, Anna Monro provided this description (pasted here with her permission):

it depends on what you're trying to achieve (sorry, isn't that always the answer?).

  • In APNI we endeavour to capture all the names ever applied to the Australian flora in the botanical literature, for both native and naturalised plants. This includes things like phrase names, hybrid formulae, and names and designations that are not actually acceptable under the Code of botanical nomenclature (like illegitimate names and invalidly published designations). My general idea of APNI is that a name has appeared in a botanical taxonomic work (e.g. a flora, a census, a checklist, a book on eucalypts) you could throw it at APNI and get a result, even if it was a typo or a temporary placeholder.

  • APNI also endeavours to record all the significant published works in which the given name appeared. The main APNI entry for a name will record every place it was listed as an accepted name and the works in which that occurred. However, there are also cross-references built in, so if it was used as a synonym of another name or it was misapplied to another taxon, you'll see a work listed with the notation "synonym of: " or "misapplied to: ".

  • APC uses the APNI data as a basis to build an accepted taxonomy of the APNI native and naturalised flora. The aim is theoretically to account for every name listed in APNI, whether that be as an accepted name, a synonym or a misapplication. So the vision with APC would be that you stick a name in and APC either indicates it's currently accepted or it points you at the currently accepted name (or names; it's not always 1:1).

  • APC is largely complete, other than the Orchidaceae and a backlog of names published recently that are yet to be considered.

Anna can answer questions about overall diagnosis and usage. For more technical question on the API, Anna has directed us to Anne Fuschs and her team

I'm can contact them both as needed.

dfalster avatar Apr 09 '20 23:04 dfalster

For fuzzy search you can use the suggestions API on APNI and APC. It is meant for suggestions as you type and is case insensitive. It does not help spelling mistakes.https://biodiversity.org.au/nsl/docs/main.html#suggestions-api-v1-0

We are here on github https://github.com/bio-org-au BTW so you can contact us there too and see what the API code actually does. :-)

pmcneil avatar Apr 10 '20 04:04 pmcneil

@pmcneil regardint the results of this request https://biodiversity.org.au/nsl/services/rest/taxon/apni/51311124.json - I'm interested in pulling out data for each child of the target taxon, here's the first one as a list in R:

$displayHtml
[1] "<data><scientific><name data-id='165295'><scientific><name data-id='56859'><element>Acacia</element></name></scientific> <element>abbatiana</element> <authors><author data-id='1524' title='Pedley, L.'>Pedley</author></authors></name></scientific><name-status class=\"legitimate\">, legitimate</name-status> <citation><ref data-id='42942'><ref-section><author>CHAH</author> <year>(2006)</year>, <par-title><i>Australian Plant Census</i></par-title></ref-section></ref></citation></data>"

$elementLink
[1] "https://id.biodiversity.org.au/tree/51352295/51223378"

$nameLink
[1] "https://id.biodiversity.org.au/name/apni/165295"

$instanceLink
[1] "https://id.biodiversity.org.au/instance/apni/603763"

$excluded
[1] FALSE

$depth
[1] 9

$synonymsHtml
[1] "<synonyms><nom><scientific><name data-id='190638'><scientific><name data-id='103551'><element>Racosperma</element></name></scientific> <element>abbatianum</element> <authors>(<base data-id='1524' title='Pedley, L.'>Pedley</base>) <author data-id='1524' title='Pedley, L.'>Pedley</author></authors></name></scientific><name-status class=\"legitimate\">, legitimate</name-status> <year>(2003)</year> <type>nomenclatural synonym</type></nom><tax><scientific><name data-id='168777'><scientific><name data-id='56859'><element>Acacia</element></name></scientific> <element>sp. Mt Abbot (A.R.Bean 4873)</element></name></scientific><name-status class=\"[n/a]\">, [n/a]</name-status> <year>(1997)</year> <type>taxonomic synonym</type></tax></synonyms>"

Seems that I'd need to further parse those html strings to get names and other data out. Is there a different route or content type that I can request that has that data parsed already? The html in displayHtml doesn't seem to be organized in a way that I can figure out how to parse with xpath. If we look at the displayHtml from above

<html>
<body>
  <data>
    <scientific>
      <name data-id="165295">
        <scientific>
          <name data-id="56859">
            <element>Acacia</element>
          </name>
        </scientific>
        <element>abbatiana</element> 
        <authors>
          <author data-id="1524" title="Pedley, L.">Pedley</author>
        </authors>
      </name>
    </scientific>
    <name-status class="legitimate">, legitimate</name-status>
    <citation>
      <ref data-id="42942"><ref-section><author>CHAH</author><year>(2006)</year>, <par-title><i>Australian Plant Census</i></par-title></ref-section>
      </ref>
    </citation>
  </data>
</body>
</html>

The first <element> is nested within its own <scientific> tag, but then the 2nd <element> is not nested within its own <scientific> tag. Maybe I'm missing something here?

sckott avatar Apr 10 '20 23:04 sckott

This is linked data, so follow the links. The https://id.biodiversity.org.au/name/apni/165295 link will get you name data. If you ask for it in XML, JSON or just HTML you'll get that as a result. See https://biodiversity.org.au/nsl/docs/main.html#name for example. The display HTML is there to provide a quick way of displaying quite complex results. The embedded name HTML is marked up to a) make it parsable and b) make the display of the name in HTML configurable using CSS. Note the data-id attributes are ONLY for linking up name parts in Javascript etc. in browser, not to be stored as a reference to the object. ALWAYS use the ID (https://id....) as the reference. Once again, this is linked data.

On linked data, above you are using this 'https://biodiversity.org.au/nsl/services/rest/taxon/apni/51311124.json' which is fine for a question, but it is not the reference to the tree element that should be quoted or passed around, the element link is https://id.biodiversity.org.au/tree/51352295/51342774

Re the nesting of <scientific> or actually <name> elements: You're almost there, it is nested. ie. there are two name parts in the name, the Acacia is a scientific name in itself the Acacia abbatiana is a name with two parts. Using xpath name.elment = abbatiana, name.scientific.name.element = Acacia - looks counter intuitive in xpath, but the name in question here is abbatiana, and it has a parent part Acacia. (hope that makes sense :-) )

pmcneil avatar Apr 12 '20 23:04 pmcneil

Thanks @pmcneil for the explanations. I still don't quite grok all the different identifiers. Is there any documentation on the identifiers?

sckott avatar Apr 13 '20 21:04 sckott

All the existing documentation is at https://biodiversity.org.au/nsl/docs/main.html

There's probably not much to know about identifiers. Just remember, everything that starts with https://id.biodiversity.org.au is an identifier, identifiers are a "black box" that identifies something or other, and you can find what it identifies by going to that URL. (and if you add .json to the URL, or pass application/json as the contentType on the request, you'll find out what is behind the identifier in json format).

chrisbitmead avatar Apr 14 '20 05:04 chrisbitmead

@chrisbitmead The identifiers refer to specific objects/things e.g. name, reference, author, "instance" etc.

The instance is in many ways a taxon, though the taxon link is to where an accepted instance sits in the accepted classification (APC). while we do have documentation it may not be adequate. I believe Anne is going to respond to your queries too, but I hope the above makes a some sense?

pmcneil avatar Apr 15 '20 02:04 pmcneil

Thanks @chrisbitmead and @pmcneil for the clarifications

I've not been able to work on this for a few weeks, i'll get back to this soon

sckott avatar Apr 24 '20 15:04 sckott

to do"

  • ~~is there pagination? https://github.com/bio-org-au/nsl-documentation/issues/1~~ there is no pagination
  • [x] get_apni() - main user facing fxn in taxize, not sure what to use, right. now thinking to use the suggest API because it can do fuzzy search and is fast to parse
  • [x] add support to classification
  • [x] add support to children
  • [x] vectorize functions where possible

sckott avatar Sep 18 '20 00:09 sckott

@dfalster get_apni() working now and all apni utility fxns vectorized.

children and classification not working yet.

problem with children i'm hitting is I don't see a way to get to the id needed for the page that has children from a name id., e..g, above you shared the link https://biodiversity.org.au/nsl/services/rest/taxon/apni/51311124 that has taxonomic children for Acacia. however, the name id for Acacia is https://biodiversity.org.au/nsl/services/rest/name/apni/56859/api/apni-format and I don't see how to get that 51311124 id programmatically to be able to get children. Any ideas?

sckott avatar Nov 11 '20 03:11 sckott

@sckott Does this help?...

https://biodiversity.org.au/nsl/services/rest/name/apni/56859/api/apc.json

chrisbitmead avatar Nov 11 '20 05:11 chrisbitmead

@chrisbitmead ah thanks, that will probably do it

sckott avatar Nov 11 '20 19:11 sckott

Thanks for your continued work here @sckott !

dfalster avatar Nov 11 '20 19:11 dfalster

@dfalster when you get a chance to try this out: children and classification are now done as well.

remotes::install_github("ropensci/taxize@australian")
library(taxize)
# see man files for each fxn
?`apni-search`
?apni_classification
?apni_children
?apni_family
?apni_id
?chidren
?classification

sckott avatar Nov 11 '20 23:11 sckott

Hi @sckott. Cool! seems to be mostly working. I can confirm that get_apni, apni_classification, apni_children, apni_search all work well.

The only issue I encountered is that apni_family does not return sensible results. E.g. The following is a search for Eucalyptus regnant, id = 101747, which should return "Myrtaceae":

> apni_family(id = 101747)
[[1]]
[[1]]$name
[1] "regnans"

[[1]]$link
[1] "https://id.biodiversity.org.au/name/apni/101747"

[[1]]$instances
# A tibble: 30 x 7
   type      link              pages   name                                      protologue citation                                   auth_year          
   <chr>     <chr>             <chr>   <chr>                                     <lgl>      <chr>                                      <chr>              
 1 secondar… https://id.biodi… 181     <scientific><name data-id='54484'><eleme… FALSE      Bailey, F.M. (1913), Comprehensive Catalo… F.M.Bailey, 1913   
 2 secondar… https://id.biodi… 54

The other BIG point to consider is that the Australian taxonomic system has two components: The Australian Plant Names Index (APNI) & the Australian Plant Census (APC). So far we have linked again the first, but ideally we would be able to query both. I'm not an expert on the distinction, but my understanding is that the APC contains all the information about currently accepted species, included whether a name is a synonym or not.

dfalster avatar Nov 17 '20 00:11 dfalster

Thanks for having a look!

Okay, i'll

  • [ ] have a look at family
  • [ ] make a toggle for apc or apni

sckott avatar Nov 17 '20 02:11 sckott

@dfalster hmm, do we need the family function? I don't think you asked for it as far as I can remember. I think I added it just cause the route is there, but you can easily get family with classification(101747, db = "apni"). Okay if I remove the apni_family function?

sckott avatar Nov 20 '20 21:11 sckott

@dfalster For APC vs. APNI, it doesn't seem like a simple thing we can allow users to switch between. Let's look at the API routes used in the functions we have so far:

  • apni_search: uses nsl/services/api/name/taxon-search/. docs say its for APC only right now.
  • apni_suggest: uses nsl/services/suggest/acceptableName/. docs have
  • apni-search - search APNI on full name as per the apni name search service,
  • apc-search - search APC on full name as per the search service

whereas I'm using acceptableName (last part of route above). So looks like I could allow users to go between apni and apc for this fxn

  • apni_acceptable_names: uses nsl/services/api/name/acceptable-name/ - docs don't mention whether this is for APC or APNI or both or what
  • apni_classification: uses nsl/services/rest/name/apni/{id}/api/branch/ - docs say this route gets the APC branch, but there doesn't appear to be an APNI version of this
  • apni_children: uses nsl/services/rest/name/apni/{id}/api/apc/ - grabs a url within that response, e.g., https://biodiversity.org.au/nsl/services/rest/taxon/apni/51311124, and uses that to get children - does that have an APC equivalent? I don't know
  • apni_id: uses nsl/services/rest/name/apni/{id}/ - I don't see an APC equivalent for this one, do you?

sckott avatar Nov 20 '20 22:11 sckott

@dfalster ☝🏽 any thoughts?

sckott avatar Dec 09 '20 18:12 sckott