taxize
taxize copied to clipboard
Missing children of Bacteria
Session Info
Session info ------------------------------------------------------------------
setting value
version R version 3.4.3 (2017-11-30)
system x86_64, linux-gnu
ui X11
language (EN)
collate en_US.UTF-8
tz America/Chicago
date 2018-02-03
Packages ----------------------------------------------------------------------
package * version date source
ape 5.0 2017-10-30 cran (@5.0)
assertthat 0.2.0 2017-04-11 CRAN (R 3.4.1)
base * 3.4.3 2017-11-30 local
bindr 0.1 2016-11-13 CRAN (R 3.4.1)
bindrcpp * 0.2 2017-06-17 CRAN (R 3.4.1)
bit 1.1-12 2014-04-09 CRAN (R 3.4.1)
bit64 0.9-7 2017-05-08 CRAN (R 3.4.1)
blob 1.1.0 2017-06-17 CRAN (R 3.4.1)
bold 0.5.0 2017-07-21 CRAN (R 3.4.2)
cli 1.0.0 2017-11-05 CRAN (R 3.4.3)
codetools 0.2-15 2016-10-05 CRAN (R 3.4.1)
colorout * 1.1-2 2017-09-23 Github (jalvesaq/colorout@020a14d)
commonmark 1.4 2017-09-01 CRAN (R 3.4.1)
compiler 3.4.3 2017-11-30 local
crayon 1.3.4 2017-09-16 CRAN (R 3.4.1)
crul 0.5.0 2018-01-22 cran (@0.5.0)
curl 3.1 2017-12-12 cran (@3.1)
data.table 1.10.4-3 2017-10-27 cran (@1.10.4-)
datasets * 3.4.3 2017-11-30 local
DBI 0.7 2017-06-18 CRAN (R 3.4.1)
dbplyr 1.2.0 2018-01-03 cran (@1.2.0)
devtools * 1.13.4 2017-11-09 CRAN (R 3.4.2)
digest 0.6.13 2017-12-14 CRAN (R 3.4.3)
dplyr * 0.7.4 2017-09-28 cran (@0.7.4)
foreach 1.4.4 2017-12-12 CRAN (R 3.4.3)
glue 1.2.0 2017-10-29 cran (@1.2.0)
graphics * 3.4.3 2017-11-30 local
grDevices * 3.4.3 2017-11-30 local
grid 3.4.3 2017-11-30 local
hms 0.4.0 2017-11-23 CRAN (R 3.4.2)
hoardr 0.2.0 2017-05-10 CRAN (R 3.4.2)
httr 1.3.1 2017-08-20 CRAN (R 3.4.1)
iterators 1.0.9 2017-12-12 CRAN (R 3.4.3)
jsonlite 1.5 2017-06-01 CRAN (R 3.4.1)
lattice 0.20-35 2017-03-25 CRAN (R 3.4.3)
magrittr * 1.5 2014-11-22 CRAN (R 3.4.1)
memoise 1.1.0 2017-04-21 CRAN (R 3.4.1)
methods * 3.4.3 2017-11-30 local
nlme 3.1-131 2017-02-06 CRAN (R 3.4.3)
parallel 3.4.3 2017-11-30 local
pillar 1.1.0 2018-01-14 cran (@1.1.0)
pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.1)
plyr 1.8.4 2016-06-08 CRAN (R 3.4.1)
pryr * 0.1.3 2017-10-30 cran (@0.1.3)
purrr 0.2.4 2017-10-18 CRAN (R 3.4.2)
R6 2.2.2 2017-06-17 CRAN (R 3.4.1)
rappdirs 0.3.1 2016-03-28 CRAN (R 3.4.2)
Rcpp 0.12.15 2018-01-20 cran (@0.12.15)
readr 1.1.1 2017-05-16 CRAN (R 3.4.1)
reshape 0.8.7 2017-08-06 CRAN (R 3.4.2)
reshape2 1.4.3 2017-12-11 cran (@1.4.3)
rlang 0.1.6 2017-12-21 cran (@0.1.6)
RMySQL 0.10.13 2017-08-14 CRAN (R 3.4.2)
roxygen2 6.0.1 2017-02-06 CRAN (R 3.4.2)
RPostgreSQL 0.6-2 2017-06-24 CRAN (R 3.4.2)
RSQLite 2.0 2017-06-19 CRAN (R 3.4.2)
stats * 3.4.3 2017-11-30 local
stringi 1.1.6 2017-11-17 CRAN (R 3.4.2)
stringr 1.2.0 2017-02-18 CRAN (R 3.4.1)
taxize * 0.9.1.9321 2018-02-03 Github (ropensci/taxize@319e03d)
taxizedb * 0.1.6 <NA> local
testthat * 2.0.0 2017-12-13 CRAN (R 3.4.3)
tibble 1.4.2 2018-01-22 cran (@1.4.2)
tidyr 0.7.2 2017-10-16 cran (@0.7.2)
tools 3.4.3 2017-11-30 local
triebeard 0.3.0 2016-08-04 CRAN (R 3.4.2)
urltools 1.7.0 2018-01-20 cran (@1.7.0)
utils * 3.4.3 2017-11-30 local
withr 2.1.1 2017-12-19 CRAN (R 3.4.3)
xml2 1.2.0 2018-01-24 cran (@1.2.0)
zoo 1.8-1 2018-01-08 CRAN (R 3.4.3)
The dev
version of taxize
produces the following:
> taxize::children(2, db='ncbi', ambiguous=FALSE)[[1]]
childtaxa_id childtaxa_name childtaxa_rank
1 508458 Synergistetes phylum
2 203691 Spirochaetes phylum
3 200940 Thermodesulfobacteria phylum
4 200938 Chrysiogenetes phylum
5 200930 Deferribacteres phylum
6 200918 Thermotogae phylum
7 200783 Aquificae phylum
8 74152 Elusimicrobia phylum
9 68297 Dictyoglomi phylum
10 67814 Caldiserica phylum
11 57723 Acidobacteria phylum
12 40117 Nitrospirae phylum
13 32066 Fusobacteria phylum
14 1224 Proteobacteria phylum
This is missing several taxa retrieved from taxizedb
:
> taxizedb::children(2, db='ncbi', ambiguous=FALSE)[[1]]
childtaxa_id childtaxa_name childtaxa_rank
1 1936987 Balneolaeota phylum
2 1930617 Calditrichaeota phylum
3 1853220 Rhodothermaeota phylum
4 1802340 Nitrospinae/Tectomicrobia group no rank
5 1783272 Terrabacteria group no rank
6 1783270 FCB group no rank
7 1783257 PVC group no rank
8 508458 Synergistetes phylum
9 203691 Spirochaetes phylum
10 200940 Thermodesulfobacteria phylum
11 200938 Chrysiogenetes phylum
12 200930 Deferribacteres phylum
13 200918 Thermotogae phylum
14 200783 Aquificae phylum
15 74152 Elusimicrobia phylum
16 68297 Dictyoglomi phylum
17 67814 Caldiserica phylum
18 57723 Acidobacteria phylum
19 40117 Nitrospirae phylum
20 32066 Fusobacteria phylum
21 1224 Proteobacteria phylum
Which also matches the taxa on NCBI taxonomy
i get the same thing, will look
@arendsee
so this is the http request made
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=taxonomy&db=taxonomy&id=2&term=Bacteria%5BNext%20Level%5D&RetMax=1000&RetStart=0&api_key=useyourkey
wonder if ther's anything in that request that strikes you as off, sometyhing we could change that would bring it in line with what taxizedb
gives
also @zachary-foster or @dwinter maybe you have a sense for why results are different from ENTREZ API vs. a dump of their database?
one think that I wonder about is the version of the database that ENTREZ is using could differ from what any user has on their disk if using taxizedb
- one thing to note in docs somewhere at least
Hmm... not something I know much about but I don't think it's an issue of versions. The browser is 'live', the FTP dumps are updated hourly and the eUtils databse is updated daily.
I would guess there is some trick in what exactly what to sent to elink. Using esearch
instead of elink
there is the special term NXLV
for immediate descendants. This gets most of the ones missing from taxize:
library(rentrez)
one_down <- entrez_search(db="taxonomy", term="Bacteria[NXLV]", use_history=TRUE)
summs <- entrez_summary(db="taxonomy", web_history=one_down$web_history)
t(extract_from_esummary(summs, c("scientificname", "rank", "taxid")))
scientificname rank taxid
1936987 "Balneolaeota" "phylum" 1936987
1930617 "Calditrichaeota" "phylum" 1930617
1853220 "Rhodothermaeota" "phylum" 1853220
1802340 "Nitrospinae/Tectomicrobia group" "" 1802340
1783272 "Terrabacteria group" "" 1783272
1783270 "FCB group" "" 1783270
1783257 "PVC group" "" 1783257
629425 "Bacteria ferula" "species" 629425
629405 "Bacteria bahiensis" "species" 629405
629404 "Bacteria baculus" "species" 629404
629403 "Bacteria apolinari" "species" 629403
629401 "Bacteria ambigua" "species" 629401
629398 "Bacteria acuminatocercata" "species" 629398
629397 "Bacteria aborigena" "species" 629397
629396 "Bacteria abnormis" "species" 629396
508458 "Synergistetes" "phylum" 508458
203691 "Spirochaetes" "phylum" 203691
200940 "Thermodesulfobacteria" "phylum" 200940
200938 "Chrysiogenetes" "phylum" 200938
200930 "Deferribacteres" "phylum" 200930
200918 "Thermotogae" "phylum" 200918
200783 "Aquificae" "phylum" 200783
74152 "Elusimicrobia" "phylum" 74152
68297 "Dictyoglomi" "phylum" 68297
67814 "Caldiserica" "phylum" 67814
57723 "Acidobacteria" "phylum" 57723
48479 "environmental samples" "" 48479
40117 "Nitrospirae" "phylum" 40117
32066 "Fusobacteria" "phylum" 32066
2323 "unclassified Bacteria" "" 2323
1224 "Proteobacteria" "phylum" 1224
Not sure how helpful this is for the specific question, but it at least shows these taxa are accessible via eUtils.... :confused:
@sckott Hmm, nothing about the request seems off to me. Some of the missing phyla are fairly new, see https://www.ncbi.nlm.nih.gov/pubmed/27287844. I wonder if there is some something screwy on the Entrez side? Stale cached values for children ("Next Level"), perhaps?
I am not sure either. Perhaps the term=Bacteria[Next Level]
is filtering out some things that are associated with taxon ID 2, but not with "Bacteria" for some reason. Ideally, the term
argument would not be needed, since we just want to child IDs for ID 2, regardless of the "term", but we never we able to get ENTREZ to do that.
By the way, the title of this issue sounds like an interesting science fiction novel.
thanks @dwinter @arendsee @zachary-foster
@dwinter your approach might work, though i'm not sure how we'd programmatically filter out to get only the direct children. i guess we can consult our iternal data.frame of ranks and their orders and only pick the direct descendant rank from the one queried? thoughts folks?