rotl icon indicating copy to clipboard operation
rotl copied to clipboard

create function to check that the taxa listed by tnrs_match_names are in the TOL

Open fmichonneau opened this issue 10 years ago • 14 comments

fmichonneau avatar Jun 20 '15 16:06 fmichonneau

Oh, seems I stumbled on this independently (OpenTreeOfLife/opentree#777)

dwinter avatar Oct 01 '15 22:10 dwinter

I had put this for fossils originally, but the case you described in the issue is probably even more widespread/useful to document. Maybe we ought to add this in the "how to use rotl?" FAQ vignette for now....

fmichonneau avatar Oct 01 '15 23:10 fmichonneau

Sounds like a good idea -- and happily I can use the work I'm doing now, including the workaround, as the example.

dwinter avatar Oct 01 '15 23:10 dwinter

Do you have a test case for the fossil taxa @fmichonneau ? I think the new TNRS "flag" column might deal with this?

dwinter avatar Apr 19 '16 16:04 dwinter

Unfortunately, it doesn't seem like it.... Taking the example from the initial example you had reported to OTL:

tol_induced_subtree(unlist(ott_id(tnrs_match_names(c("Anas", "Gallus", "Anolis", "Geospiza")))))
Error: HTTP failure: 400
The following OTT ids were not found: [765185, 5295932]. 

but there is no indication in the taxonomy that these nodes might be missing from the tree (nothing in flags indicate it might be missing):

> taxonomy_taxon_info(765185)
$`765185`
$`765185`$is_suppressed
[1] FALSE

$`765185`$tax_sources
$`765185`$tax_sources[[1]]
[1] "ncbi:8835"

$`765185`$tax_sources[[2]]
[1] "worms:148788"

$`765185`$tax_sources[[3]]
[1] "gbif:2498056"

$`765185`$tax_sources[[4]]
[1] "irmng:1105530"


$`765185`$unique_name
[1] "Anas"

$`765185`$synonyms
$`765185`$synonyms[[1]]
[1] "Anus"

$`765185`$synonyms[[2]]
[1] "Anassus"

$`765185`$synonyms[[3]]
[1] "Spatula"

$`765185`$synonyms[[4]]
[1] "Aras"


$`765185`$name
[1] "Anas"

$`765185`$flags
list()

$`765185`$ott_id
[1] 765185

$`765185`$rank
[1] "genus"


attr(,"class")
[1] "taxon_info"

and not very useful information from the tol enpoint either:

> tol_node_info(765185)
Error: HTTP failure: 400
Could not find any synthetic tree nodes corresponding to the OTT id provided (765185).

fmichonneau avatar Apr 19 '16 17:04 fmichonneau

If those taxa are not in the tree, it is because they are not monophyletic in the tree. OT used to return something like "invalid_ids" or "valid_but_not_in_tree" (not those names exactly, but you get the point), but not anymore (because the tree server no longer contains the entire taxonomy, and so cannot distinguish invalid-ids from valid-but-not-monophyletic ids).

josephwb avatar Apr 19 '16 17:04 josephwb

Would it be worth hacking something on our side then? We could check whether the ott ids are in the taxonomy when they are not in the tree to give a more informative error message

fmichonneau avatar Apr 19 '16 17:04 fmichonneau

Sounds good for now. I imagine OT will fix this, but no time soon.

josephwb avatar Apr 19 '16 17:04 josephwb

Hmm... looking at this a little more, I think it would be too hackish for us to do. Let's leave as it is, and point to the relevant section of the vignette if needed.

fmichonneau avatar Apr 19 '16 18:04 fmichonneau

Hi, Thanks a lot for developing this very nice R package ;)

I came across this error message after passing a list of 189 plant families to tnrs_match_names(). (reproducible example below, sorry if it is too). No warnings from tnrs_match_names()

families <- c("Asteraceae", "Poaceae", "Rosaceae", "Fabaceae", 
    "Salicaceae", "Lamiaceae", "Betulaceae", "Apiaceae", 
    "Brassicaceae", "Fagaceae", "Cyperaceae", "Pinaceae", 
    "Ranunculaceae", "Ericaceae", "Caprifoliaceae", "Plantaginaceae", 
    "Caryophyllaceae", "Polygonaceae", "Boraginaceae", "Rubiaceae", 
    "Sapindaceae", "Malvaceae", "Scrophulariaceae", "Cactaceae", 
    "Amaranthaceae", "Oleaceae", "Euphorbiaceae", "Ulmaceae", 
    "Cupressaceae", "Juncaceae", "Campanulaceae", "Urticaceae", 
    "Geraniaceae", "Solanaceae", "Grossulariaceae", "Adoxaceae", 
    "Onagraceae", "Hypericaceae", "Orobanchaceae", "Rhamnaceae", 
    "Primulaceae", "Crassulaceae", "Cornaceae", "Cistaceae", 
    "Vitaceae", "Asparagaceae", "Violaceae", "Iridaceae", 
    "Papaveraceae", "Equisetaceae", "Gentianaceae", "Typhaceae", 
    "Amaryllidaceae", "Bromeliaceae", "Anacardiaceae", "Dennstaedtiaceae", 
    "Dryopteridaceae", "Lythraceae", "Elaeagnaceae", "Apocynaceae", 
    "Convolvulaceae", "Berberidaceae", "Celastraceae", "Orchidaceae", 
    "Resedaceae", "Cucurbitaceae", "Araliaceae", "Balsaminaceae", 
    "Cannabaceae", "Rutaceae", "Araceae", "Araucariaceae", 
    "Santalaceae", "Linaceae", "Platanaceae", "Saxifragaceae", 
    "Juglandaceae", "Liliaceae", "Haloragaceae", "Tamaricaceae", 
    "Athyriaceae", "Moraceae", "Taxaceae", "Arecaceae", "Aspleniaceae", 
    "Lauraceae", "Melanthiaceae", "Plumbaginaceae", "Tropaeolaceae", 
    "Alismataceae", "Buxaceae", "Hydrocharitaceae", "Zamiaceae", 
    "Menyanthaceae", "Aquifoliaceae", "Hydrangeaceae", "Myricaceae", 
    "Polypodiaceae", "Polytrichaceae", "Juncaginaceae", "Nymphaeaceae", 
    "Polemoniaceae", "Potamogetonaceae", "Sphagnaceae", "Tectariaceae", 
    "Verbenaceae", "Aizoaceae", "Cystopteridaceae", "Theaceae", 
    "Asphodelaceae", "Ephedraceae", "Myrtaceae", "Onocleaceae", 
    "Pteridaceae", "Thymelaeaceae", "Brachytheciaceae", "Capparaceae", 
    "Ceratophyllaceae", "Cleomaceae", "Cycadaceae", "Oxalidaceae", 
    "Acanthaceae", "Amblystegiaceae", "Hylocomiaceae", "Loranthaceae", 
    "Mniaceae", "Zygophyllaceae", "Bignoniaceae", "Blechnaceae", 
    "Butomaceae", "Dicranaceae", "Magnoliaceae", "Paeoniaceae", 
    "Piperaceae", "Polygalaceae", "Portulacaceae", "Strelitziaceae", 
    "Acoraceae", "Basellaceae", "Bryaceae", "Burseraceae", 
    "Commelinaceae", "Droseraceae", "Ebenaceae", "Lentibulariaceae", 
    "Musaceae", "Nephrolepidaceae", "Passifloraceae", "Plagiotheciaceae", 
    "Pontederiaceae", "Pottiaceae", "Ricciaceae", "Salviniaceae", 
    "Staphyleaceae", "Thelypteridaceae", "Zingiberaceae", 
    "Altingiaceae", "Anemiaceae", "Annonaceae", "Aristolochiaceae", 
    "Begoniaceae", "Cannaceae", "Climaciaceae", "Colchicaceae", 
    "Ditrichaceae", "Elatinaceae", "Gleicheniaceae", "Goodeniaceae", 
    "Grimmiaceae", "Hamamelidaceae", "Hedwigiaceae", "Heliconiaceae", 
    "Hypnaceae", "Loasaceae", "Malpighiaceae", "Marchantiaceae", 
    "Martyniaceae", "Nyctaginaceae", "Pedaliaceae", "Phrymaceae", 
    "Phytolaccaceae", "Pittosporaceae", "Proteaceae", "Ruppiaceae", 
    "Sapotaceae", "Schisandraceae", "Sciadopityaceae", "Styracaceae", 
    "Thuidiaceae")
resolved_names <- tnrs_match_names(families, context_name = "Land plants")
head(resolved_names)
#>   search_string unique_name approximate_match ott_id is_synonym flags
#> 1    asteraceae  Asteraceae             FALSE  46248      FALSE      
#> 2       poaceae     Poaceae             FALSE 508090      FALSE      
#> 3      rosaceae    Rosaceae             FALSE 208036      FALSE      
#> 4      fabaceae    Fabaceae             FALSE 560323      FALSE      
#> 5    salicaceae  Salicaceae             FALSE 530183      FALSE      
#> 6     lamiaceae   Lamiaceae             FALSE 544714      FALSE      
#>   number_matches
#> 1              1
#> 2              1
#> 3              1
#> 4              1
#> 5              1
#> 6              1
tr <- tol_induced_subtree(ott_ids = ott_id(resolved_names))
#> Error: HTTP failure: 400
#> The following OTT ids were not found: [147029, 473827, 23373, 17704, 601168, 873718, 614459, 367508, 461417, 79118, 99242, 405426, 427298, 195706, 195710, 548799, 5302233, 734781, 947452, 853767, 195711, 737324, 981715, 734790, 216633, 460575, 13254]. BadIdsExceptionopentree.plugins.BadIdsExceptionlist("opentree.plugins.tree_of_life_v3.doInducedSubtree(tree_of_life_v3.java:516)", "opentree.plugins.tree_of_life_v3.induced_subtree(tree_of_life_v3.java:400)", "java.lang.reflect.Method.invoke(Method.java:498)", "org.neo4j.server.plugins.PluginMethod.invoke(PluginMethod.java:57)", "org.neo4j.server.plugins.PluginManager.invoke(PluginManager.java:168)", "org.neo4j.server.rest.web.ExtensionService.invokeGraphDatabaseExtension(ExtensionService.java:300)", "org.neo4j.server.rest.web.ExtensionService.invokeGraphDatabaseExtension(ExtensionService.java:122)", 
#>     "java.lang.reflect.Method.invoke(Method.java:498)", "org.neo4j.server.rest.security.SecurityFilter.doFilter(SecurityFilter.java:112)")

paternogbc avatar Jul 28 '16 19:07 paternogbc

Hi @paternogbc.

This is an Open Tree issue, not a rotl issue. As mentioned above and on the linked page, if any ott_id is not matched, an error is returned. The reason an ott_id is not matched is because 1) it is invalid or 2) it is not monophyletic in the synthetic tree (i.e. the tree does not pass through the taxon, so an induced tree cannot be returned).

Would you prefer that such taxa are skipped in the query such that a tree with as many as possible query taxa are present?

josephwb avatar Jul 28 '16 20:07 josephwb

Hi @josephwb,

thanks for you reply. Yes, I would say that skipping 'invalid' taxa + printing a more specific/detailed warning about which/why taxa were dropped will be very useful. Perhaps including a short note on the documentation explaining the issue might also help.

paternogbc avatar Jul 28 '16 21:07 paternogbc

@josephwb correct me if I'm wrong, but I think last time I looked into it, it was not possible to check a priori whether an ott_id was present in the synthetic tree, and so it's not possible to warn the user until it fails.

On the rotl side, I guess we could wrap the call with try(), and if it fails retrieve the missing ott_ids from the error message, remove them from the query, and ask for the tree without them. A little clunky but maybe less surprising to users.

fmichonneau avatar Jul 28 '16 22:07 fmichonneau

@fmichonneau Individual ott_ids can be queried using node_info. This could be very slow and tedious, but doable.

However, it looks like this may be fixed soon. Probably better to have it fixed at Open Tree than hack something together here.

josephwb avatar Jul 29 '16 01:07 josephwb