taxize
taxize copied to clipboard
Australian Plant Names Index (APNI), Australian Plant Census (APC) and Australian National Species List (NSL)
Thanks for the great package. Wondering about the possibility of connecting taxize to these Australian taxonomic resources, which are all accessible via a good APIs, as part of the Australian National Species List (NSL) infrastructure.
There are two main services, available via https://biodiversity.org.au/nsl/services/,
- The Australian Plant Census (APC) provides a nationally-accepted taxonomy
- The Australian Plant Name Index (APNI) provides names and bibliographic information.
As described at the above link "this section of the National Species List infrastructure delivers names and taxonomies for flowering plants, ferns, gymnosperms, hornworts, and liverworts. The data comprise names, bibliographic information, and taxonomic concepts for plants that are either native to or naturalised in Australia. ..... The taxonomy and nomenclature adopted for the APC are endorsed by the Council of Heads of Australasian Herbaria (CHAH)." There also a tree available at https://biodiversity.org.au/nsl/services/rest/tree/apni/51209179
The API is described https://biodiversity.org.au/nsl/docs/main.html
Having a programmatic interface in R to these resources would be a big deal for Australian research. If it's possible to add to taxize, this seems preferable to developing a separate package.
Can you let us know whether you think this would be possible @sckott ?
thanks @dfalster !
I'll have a look into the docs. At first glance at the docs I think it will work, but i'll get back to you soon with further thoughts
Are there equivalent data sources for Australian animals?
Actually, I played with the API a little bit but I don't see any real search capbability. For example, you can search on the website for APNI names here https://biodiversity.org.au/nsl/services/APNI but with the API I don't see any way to do the same thing. There's this https://biodiversity.org.au/nsl/docs/main.html#taxon-search API route, but it only appears to be get one name
curl -L -H "Accept: application/json" 'https://biodiversity.org.au/nsl/services/api/name/taxon-search?q=Acacia' | jq .
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1695 0 1695 0 0 652 0 --:--:-- 0:00:02 --:--:-- 652
{
"records": {
"synonyms": [
{
"taxonID": "https://id.biodiversity.org.au/taxon/apni/51311124",
"nameType": "scientific",
"acceptedNameUsageID": "https://id.biodiversity.org.au/taxon/apni/51311124",
"acceptedNameUsage": "Acacia Mill.",
"nomenclaturalStatus": null,
"taxonomicStatus": "accepted",
"proParte": "false",
"scientificName": "Acacia Mill.",
"scientificNameID": "https://id.biodiversity.org.au/name/apni/56859",
"canonicalName": "Acacia",
"scientificNameAuthorship": "Mill.",
"parentNameUsageID": "https://id.biodiversity.org.au/taxon/apni/51351217",
"taxonRank": "Genus",
"taxonRankSortOrder": "120",
"kingdom": "Plantae",
"class": "Equisetopsida",
"subclass": "Magnoliidae",
"family": "Fabaceae",
"created": "2009-12-15 11:08:09.0",
"modified": "2009-12-15 11:08:09.0",
"datasetName": "APC",
"taxonConceptID": "https://id.biodiversity.org.au/instance/apni/603762",
"nameAccordingTo": "CHAH (2006), Australian Plant Census",
"nameAccordingToID": "https://id.biodiversity.org.au/reference/apni/42942",
"taxonRemarks": null,
"taxonDistribution": "WA (native and naturalised), NT (native and naturalised), SA (native and naturalised), Qld (native and naturalised), NSW (native and naturalised), NI (naturalised), ACT (native and naturalised), Vic (native and naturalised)",
"higherClassification": "Plantae|Charophyta|Equisetopsida|Magnoliidae|Rosanae|Fabales|Fabaceae|Acacia",
"firstHybridParentName": null,
"firstHybridParentNameID": null,
"secondHybridParentName": null,
"secondHybridParentNameID": null,
"nomenclaturalCode": "ICN",
"license": "http://creativecommons.org/licenses/by/3.0/",
"ccAttributionIRI": "https://id.biodiversity.org.au/taxon/apni/51311124"
}
],
"acceptedNames": {}
},
"status": {
"enumType": "org.springframework.http.HttpStatus",
"name": "OK"
}
}
wheres on the website you get many names in the results. I think we really need that fuzzy search capability to be able to make a get_apni()
or get_apc()
function - which then forms the basis for incorporating these data sources into other useful functions in taxize.
Another key thing I'd like to see is the ability to get children and parent taxa. From the output above it looks like we have parent information which is good, but not seeing a way to get taxonomic children. Do you see that in the docs?
Hi @sckott
Thanks so much for taking a look so promptly. Really appreciate it. We're looking for a tool to query the APC and APNI programmatically. I suspect if we got this going it would be used widely within Australia, not only by my group :).
You're right, I can't see how to do search a list of taxa in the docs. (But note, I don't really know what I'm looking for either, as APIs are not something I'm good at. What would such a query look like?)
As for the children, I can see that if you select a species, you can get the parent, and if you click on the parent you can get the children. From your example above, if you follow the link that is returned for Acaia, https://id.biodiversity.org.au/taxon/apni/51311124, you get a page that lists all the children.
So I wonder if it's a matter of first locating the id, then fetching the children?
If you can outline the interface needed, I can enquire whether it is possible with relevant people.
thanks, having a look
been working on some utility functions, install.packages("ropensci/taxize@australian")
, then see ?apni
- Making progress.
- the
acceptable_names
function seems to do a good job of fuzzy searching, which we need - the
apni_classification
allows us to get the taxonomic classifcation for a taxon id, which we need - I still haven't found a way to get children as you demonstrated in the browser flow above. We definitely need a API method to get children of a taxon to make this as useful as possible
Fantastic. Just tried it out, looking useful already!
So if I understand right, the two main features missing from the API are
- Ability to search a list of names, e.g.
> apni_search(q = c("Acacia", "Eucalyptus"))
Error: Bad Request (HTTP 400)
Can we solve this one by vectorising on the taxize side?
- Ability to access children?
If we take acacia as an example, the id is 51311124, so https://biodiversity.org.au/nsl/services/rest/taxon/apni/51311124 is the web page with children. Adding
.json
to the end of this gives the data, which seems to give children?
x <- jsonlite::read_json("https://biodiversity.org.au/nsl/services/rest/taxon/apni/51311124.json")
x$treeElement$children[[1]]
Also, couldn't see how to extract the apni_id for given taxa after retrieving search results.
Again, I don't know APIs so the above may be off track.
yes, can vectorize the fxns, just hadn't gotten to that yet.
nice, I think that children solution should work, will try that
Another question, how does taxize handle searches when there are spelling mistakes? I notice the APNI just returns "no results".
This is a very common issue with taxonomic name searches. In the past I have used Taxonstand, which included an argument max.distance
: A number indicating the maximum distance allowed for a match in agrep when performing corrections of spelling errors in specific epithets. Guessing you need this on server side, so if not possible in the web interface, won't be possible via Taxize.
taxize doesn't do anything automatically regarding spelling mistakes on the R side. i consider that a separate step for sure, distinct from searching one of the data sources. data sources vary widely in how they handle spelling. some do fuzzy search in which they account for possible spelling mistakes and return the closest matches, while some data sources do not fuzzy match and so return nothing or similar on no resuts found. when no results are found we typically give back NA
or similar
there are specific functions in taxize to "resolve" names. eg,. gnr_resolve()
and tnrs()
. i'd suggest running names through a resolver function first if there's concern there may be spelling mistakes. i wish there was a better solution.
Thanks for explaining. Makes sense. I can sort something for some fuzzy matching locally.
I now have contact details for the folks behind the APC/APNI service, so can put you in contact or deliver questions there, as needed
Good news about getting contacts. I'll work on this soon and see if there's any questions I have
When asked whether we should link to APC, APNI or both, Anna Monro provided this description (pasted here with her permission):
it depends on what you're trying to achieve (sorry, isn't that always the answer?).
-
In APNI we endeavour to capture all the names ever applied to the Australian flora in the botanical literature, for both native and naturalised plants. This includes things like phrase names, hybrid formulae, and names and designations that are not actually acceptable under the Code of botanical nomenclature (like illegitimate names and invalidly published designations). My general idea of APNI is that a name has appeared in a botanical taxonomic work (e.g. a flora, a census, a checklist, a book on eucalypts) you could throw it at APNI and get a result, even if it was a typo or a temporary placeholder.
-
APNI also endeavours to record all the significant published works in which the given name appeared. The main APNI entry for a name will record every place it was listed as an accepted name and the works in which that occurred. However, there are also cross-references built in, so if it was used as a synonym of another name or it was misapplied to another taxon, you'll see a work listed with the notation "synonym of:
" or "misapplied to: ". -
APC uses the APNI data as a basis to build an accepted taxonomy of the APNI native and naturalised flora. The aim is theoretically to account for every name listed in APNI, whether that be as an accepted name, a synonym or a misapplication. So the vision with APC would be that you stick a name in and APC either indicates it's currently accepted or it points you at the currently accepted name (or names; it's not always 1:1).
-
APC is largely complete, other than the Orchidaceae and a backlog of names published recently that are yet to be considered.
Anna can answer questions about overall diagnosis and usage. For more technical question on the API, Anna has directed us to Anne Fuschs and her team
I'm can contact them both as needed.
For fuzzy search you can use the suggestions API on APNI and APC. It is meant for suggestions as you type and is case insensitive. It does not help spelling mistakes.https://biodiversity.org.au/nsl/docs/main.html#suggestions-api-v1-0
We are here on github https://github.com/bio-org-au BTW so you can contact us there too and see what the API code actually does. :-)
@pmcneil regardint the results of this request https://biodiversity.org.au/nsl/services/rest/taxon/apni/51311124.json - I'm interested in pulling out data for each child of the target taxon, here's the first one as a list in R:
$displayHtml
[1] "<data><scientific><name data-id='165295'><scientific><name data-id='56859'><element>Acacia</element></name></scientific> <element>abbatiana</element> <authors><author data-id='1524' title='Pedley, L.'>Pedley</author></authors></name></scientific><name-status class=\"legitimate\">, legitimate</name-status> <citation><ref data-id='42942'><ref-section><author>CHAH</author> <year>(2006)</year>, <par-title><i>Australian Plant Census</i></par-title></ref-section></ref></citation></data>"
$elementLink
[1] "https://id.biodiversity.org.au/tree/51352295/51223378"
$nameLink
[1] "https://id.biodiversity.org.au/name/apni/165295"
$instanceLink
[1] "https://id.biodiversity.org.au/instance/apni/603763"
$excluded
[1] FALSE
$depth
[1] 9
$synonymsHtml
[1] "<synonyms><nom><scientific><name data-id='190638'><scientific><name data-id='103551'><element>Racosperma</element></name></scientific> <element>abbatianum</element> <authors>(<base data-id='1524' title='Pedley, L.'>Pedley</base>) <author data-id='1524' title='Pedley, L.'>Pedley</author></authors></name></scientific><name-status class=\"legitimate\">, legitimate</name-status> <year>(2003)</year> <type>nomenclatural synonym</type></nom><tax><scientific><name data-id='168777'><scientific><name data-id='56859'><element>Acacia</element></name></scientific> <element>sp. Mt Abbot (A.R.Bean 4873)</element></name></scientific><name-status class=\"[n/a]\">, [n/a]</name-status> <year>(1997)</year> <type>taxonomic synonym</type></tax></synonyms>"
Seems that I'd need to further parse those html strings to get names and other data out. Is there a different route or content type that I can request that has that data parsed already? The html in displayHtml
doesn't seem to be organized in a way that I can figure out how to parse with xpath. If we look at the displayHtml
from above
<html>
<body>
<data>
<scientific>
<name data-id="165295">
<scientific>
<name data-id="56859">
<element>Acacia</element>
</name>
</scientific>
<element>abbatiana</element>
<authors>
<author data-id="1524" title="Pedley, L.">Pedley</author>
</authors>
</name>
</scientific>
<name-status class="legitimate">, legitimate</name-status>
<citation>
<ref data-id="42942"><ref-section><author>CHAH</author><year>(2006)</year>, <par-title><i>Australian Plant Census</i></par-title></ref-section>
</ref>
</citation>
</data>
</body>
</html>
The first <element>
is nested within its own <scientific>
tag, but then the 2nd <element>
is not nested within its own <scientific>
tag. Maybe I'm missing something here?
This is linked data, so follow the links. The https://id.biodiversity.org.au/name/apni/165295 link will get you name data. If you ask for it in XML, JSON or just HTML you'll get that as a result. See https://biodiversity.org.au/nsl/docs/main.html#name for example. The display HTML is there to provide a quick way of displaying quite complex results. The embedded name HTML is marked up to a) make it parsable and b) make the display of the name in HTML configurable using CSS. Note the data-id attributes are ONLY for linking up name parts in Javascript etc. in browser, not to be stored as a reference to the object. ALWAYS use the ID (https://id....) as the reference. Once again, this is linked data.
On linked data, above you are using this 'https://biodiversity.org.au/nsl/services/rest/taxon/apni/51311124.json' which is fine for a question, but it is not the reference to the tree element that should be quoted or passed around, the element link is https://id.biodiversity.org.au/tree/51352295/51342774
Re the nesting of <scientific>
or actually <name>
elements: You're almost there, it is nested. ie. there are two name parts in the name, the Acacia is a scientific name in itself the Acacia abbatiana is a name with two parts. Using xpath name.elment = abbatiana
, name.scientific.name.element = Acacia
- looks counter intuitive in xpath, but the name in question here is abbatiana, and it has a parent part Acacia. (hope that makes sense :-) )
Thanks @pmcneil for the explanations. I still don't quite grok all the different identifiers. Is there any documentation on the identifiers?
All the existing documentation is at https://biodiversity.org.au/nsl/docs/main.html
There's probably not much to know about identifiers. Just remember, everything that starts with https://id.biodiversity.org.au is an identifier, identifiers are a "black box" that identifies something or other, and you can find what it identifies by going to that URL. (and if you add .json to the URL, or pass application/json as the contentType on the request, you'll find out what is behind the identifier in json format).
@chrisbitmead The identifiers refer to specific objects/things e.g. name, reference, author, "instance" etc.
The instance is in many ways a taxon, though the taxon link is to where an accepted instance sits in the accepted classification (APC). while we do have documentation it may not be adequate. I believe Anne is going to respond to your queries too, but I hope the above makes a some sense?
Thanks @chrisbitmead and @pmcneil for the clarifications
I've not been able to work on this for a few weeks, i'll get back to this soon
to do"
- ~~is there pagination? https://github.com/bio-org-au/nsl-documentation/issues/1~~ there is no pagination
- [x]
get_apni()
- main user facing fxn in taxize, not sure what to use, right. now thinking to use the suggest API because it can do fuzzy search and is fast to parse - [x] add support to
classification
- [x] add support to
children
- [x] vectorize functions where possible
@dfalster get_apni()
working now and all apni utility fxns vectorized.
children and classification not working yet.
problem with children i'm hitting is I don't see a way to get to the id needed for the page that has children from a name id., e..g, above you shared the link https://biodiversity.org.au/nsl/services/rest/taxon/apni/51311124 that has taxonomic children for Acacia. however, the name id for Acacia is https://biodiversity.org.au/nsl/services/rest/name/apni/56859/api/apni-format and I don't see how to get that 51311124 id programmatically to be able to get children. Any ideas?
@sckott Does this help?...
https://biodiversity.org.au/nsl/services/rest/name/apni/56859/api/apc.json
@chrisbitmead ah thanks, that will probably do it
Thanks for your continued work here @sckott !
@dfalster when you get a chance to try this out: children and classification are now done as well.
remotes::install_github("ropensci/taxize@australian")
library(taxize)
# see man files for each fxn
?`apni-search`
?apni_classification
?apni_children
?apni_family
?apni_id
?chidren
?classification
Hi @sckott. Cool! seems to be mostly working. I can confirm that get_apni, apni_classification, apni_children, apni_search all work well.
The only issue I encountered is that apni_family
does not return sensible results. E.g. The following is a search for Eucalyptus regnant, id = 101747, which should return "Myrtaceae":
> apni_family(id = 101747)
[[1]]
[[1]]$name
[1] "regnans"
[[1]]$link
[1] "https://id.biodiversity.org.au/name/apni/101747"
[[1]]$instances
# A tibble: 30 x 7
type link pages name protologue citation auth_year
<chr> <chr> <chr> <chr> <lgl> <chr> <chr>
1 secondar… https://id.biodi… 181 <scientific><name data-id='54484'><eleme… FALSE Bailey, F.M. (1913), Comprehensive Catalo… F.M.Bailey, 1913
2 secondar… https://id.biodi… 54
The other BIG point to consider is that the Australian taxonomic system has two components: The Australian Plant Names Index (APNI) & the Australian Plant Census (APC). So far we have linked again the first, but ideally we would be able to query both. I'm not an expert on the distinction, but my understanding is that the APC contains all the information about currently accepted species, included whether a name is a synonym or not.
Thanks for having a look!
Okay, i'll
- [ ] have a look at family
- [ ] make a toggle for apc or apni
@dfalster hmm, do we need the family function? I don't think you asked for it as far as I can remember. I think I added it just cause the route is there, but you can easily get family with classification(101747, db = "apni")
. Okay if I remove the apni_family
function?
@dfalster For APC vs. APNI, it doesn't seem like a simple thing we can allow users to switch between. Let's look at the API routes used in the functions we have so far:
-
apni_search
: usesnsl/services/api/name/taxon-search/
. docs say its for APC only right now. -
apni_suggest
: usesnsl/services/suggest/acceptableName/
. docs have
- apni-search - search APNI on full name as per the apni name search service,
- apc-search - search APC on full name as per the search service
whereas I'm using acceptableName
(last part of route above). So looks like I could allow users to go between apni and apc for this fxn
-
apni_acceptable_names
: usesnsl/services/api/name/acceptable-name/
- docs don't mention whether this is for APC or APNI or both or what -
apni_classification
: usesnsl/services/rest/name/apni/{id}/api/branch/
- docs say this route gets the APC branch, but there doesn't appear to be an APNI version of this -
apni_children
: usesnsl/services/rest/name/apni/{id}/api/apc/
- grabs a url within that response, e.g.,https://biodiversity.org.au/nsl/services/rest/taxon/apni/51311124
, and uses that to get children - does that have an APC equivalent? I don't know -
apni_id
: usesnsl/services/rest/name/apni/{id}/
- I don't see an APC equivalent for this one, do you?
@dfalster ☝🏽 any thoughts?