curatedMetagenomicData icon indicating copy to clipboard operation
curatedMetagenomicData copied to clipboard

Difference in species names between Metaphlan and Pathway tables

Open microsud opened this issue 1 year ago • 1 comments

Dear Developers, When comparing the species-pathway tables and metaphlan species-abundance tables, I get different names. Apart from differences like CAG_XXX being CAG:XXX, Lactobacillus naming is different. Below is the complete reprex code. Not sure if there is some issue in the way I downloaded the two data types or inherent to the way data were generated for the pkg. Any help is highly appreciated. Thanks in advance for your time.

suppressPackageStartupMessages({
  # CRAN
  library(tidyverse)
  library(data.table)
  library(ggplot2)
  library(stringr)
  # BioC
  library(curatedMetagenomicData)

  library(reprex)
})
# Get Data
date()
#> [1] "Sat Jul 30 10:13:37 2022"

tse.bug <- sampleMetadata |>
  filter(study_name %in% c("LifeLinesDeep_2016")) |>
  filter(body_site == "stool") |>
  returnSamples("relative_abundance", rownames = "short")
#> snapshotDate(): 2022-04-26
#> 
#> $`2021-03-31.LifeLinesDeep_2016.relative_abundance`
#> dropping rows without rowTree matches:
#>   k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Atopobiaceae|g__Olsenella|s__Olsenella_profusa
#>   k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Coriobacteriaceae|g__Collinsella|s__Collinsella_stercoris
#>   k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Coriobacteriaceae|g__Enorma|s__[Collinsella]_massiliensis
#>   k__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|f__Bacillales_unclassified|g__Gemella|s__Gemella_bergeri
#>   k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Carnobacteriaceae|g__Granulicatella|s__Granulicatella_elegans
#>   k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Ruminococcus|s__Ruminococcus_champanellensis
#>   k__Bacteria|p__Firmicutes|c__Erysipelotrichia|o__Erysipelotrichales|f__Erysipelotrichaceae|g__Bulleidia|s__Bulleidia_extructa
#>   k__Bacteria|p__Proteobacteria|c__Betaproteobacteria|o__Burkholderiales|f__Sutterellaceae|g__Sutterella|s__Sutterella_parvirubra
#>   k__Bacteria|p__Synergistetes|c__Synergistia|o__Synergistales|f__Synergistaceae|g__Cloacibacillus|s__Cloacibacillus_evryensis
tse.bug
#> class: TreeSummarizedExperiment 
#> dim: 637 1135 
#> metadata(1): agglomerated_by_rank
#> assays(1): relative_abundance
#> rownames(637): Bifidobacterium angulatum Bifidobacterium longum ...
#>   Alistipes sp. CAG:831 Lactobacillus iners
#> rowData names(7): superkingdom phylum ... genus species
#> colnames(1135): EGAR00001420100_9002000001328080LL
#>   EGAR00001420101_2009000001576810 ... EGAR00001420925_9002000001461142
#>   EGAR00001420926_9002000001461438
#> colData names(135): study_name subject_id ... ALT eGFR
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: a LinkDataFrame (637 rows)
#> rowTree: 1 phylo tree(s) (10430 leaves)
#> colLinks: NULL
#> colTree: NULL

tse.pwy <- curatedMetagenomicData("LifeLinesDeep_20.+.pathway_abundance", dryrun = FALSE) |>
  mergeData()
#> snapshotDate(): 2022-04-26
tse.pwy
#> class: SummarizedExperiment 
#> dim: 23085 1135 
#> metadata(0):
#> assays(1): pathway_abundance
#> rownames(23085): UNMAPPED UNINTEGRATED ... PWY-7332: superpathway of
#>   UDP-N-acetylglucosamine-derived O-antigen building blocks
#>   biosynthesis|g__Eubacterium.s__Eubacterium_callanderi PWY-7413:
#>   dTDP-6-deoxy-&alpha;-D-allose
#>   biosynthesis|g__Eubacterium.s__Eubacterium_callanderi
#> rowData names(0):
#> colnames(1135): EGAR00001420100_9002000001328080LL
#>   EGAR00001420101_2009000001576810 ... EGAR00001421233_9002000001589917
#>   EGAR00001421234_9005000001577950
#> colData names(21): study_name subject_id ... BMI population

# Bugs
bug.abund <- assay(tse.bug) |>
  as.matrix() |>
  as.data.frame()
#bug.abund[1:4,1:5]
bug.abund$microbe <- rownames(bug.abund)
#bug.abund$microbe[1:3]


# PWY
tse.pwy
#> class: SummarizedExperiment 
#> dim: 23085 1135 
#> metadata(0):
#> assays(1): pathway_abundance
#> rownames(23085): UNMAPPED UNINTEGRATED ... PWY-7332: superpathway of
#>   UDP-N-acetylglucosamine-derived O-antigen building blocks
#>   biosynthesis|g__Eubacterium.s__Eubacterium_callanderi PWY-7413:
#>   dTDP-6-deoxy-&alpha;-D-allose
#>   biosynthesis|g__Eubacterium.s__Eubacterium_callanderi
#> rowData names(0):
#> colnames(1135): EGAR00001420100_9002000001328080LL
#>   EGAR00001420101_2009000001576810 ... EGAR00001421233_9002000001589917
#>   EGAR00001421234_9005000001577950
#> colData names(21): study_name subject_id ... BMI population
pwy.abund <- assay(tse.pwy) |>
  as.matrix() |>
  as.data.frame()
pwy.abund$feature_microbe <- rownames(pwy.abund)
#pwy.abund$feature_microbe[1:5]
pwy.abund <- pwy.abund |> as.data.table()
#pwy.abund[1:3, 1:5]
pwy.abund <- pwy.abund[, c("feature", "microbe") := tstrsplit(feature_microbe, "\\|", fixed=FALSE)]
#pwy.abund$microbe[1:6]
# unique(pwy.abund$microbe)

pwy.abund <- pwy.abund[feature != "UNGROUPED"]
pwy.abund <- pwy.abund[feature != "UNINTEGRATED"]
#unique(pwy.abund$microbe)
pwy.abund$microbe <- gsub(".*\\.s__","",pwy.abund$microbe)
unique(pwy.abund$microbe)[1:10]
#>  [1] NA                              "unclassified"                 
#>  [3] "Bifidobacterium_angulatum"     "Blautia_wexlerae"             
#>  [5] "Blautia_obeum"                 "Anaerostipes_hadrus"          
#>  [7] "Bifidobacterium_longum"        "Bifidobacterium_longum_CAG_69"
#>  [9] "Ruminococcus_torques"          "Eubacterium_rectale"
unique(bug.abund$microbe)[1:10]
#>  [1] "Bifidobacterium angulatum"       "Bifidobacterium longum"         
#>  [3] "Collinsella aerofaciens"         "Ruminococcus bromii"            
#>  [5] "Bifidobacterium bifidum"         "Dorea longicatena"              
#>  [7] "Eubacterium sp. CAG:180"         "Anaerostipes hadrus"            
#>  [9] "Fusicatenibacter saccharivorans" "Faecalibacterium prausnitzii"

#setdiff(gsub("_"," ",pwy.abund$microbe),bug.abund$microbe)
length(setdiff(gsub("_"," ",pwy.abund$microbe),bug.abund$microbe))
#> [1] 214
#unique(pwy.abund$microbe)

length(setdiff(pwy.abund$microbe, gsub(" ","_",bug.abund$microbe)))
#> [1] 214

# Some inconsistency in naming of taxa from bugs remove "." and ":"
bug.abund$microbe <- gsub(" ","_",bug.abund$microbe)
bug.abund$microbe <- gsub("sp.","sp",bug.abund$microbe)
bug.abund$microbe <- gsub("CAG:","CAG_",bug.abund$microbe)
length(setdiff(pwy.abund$microbe, bug.abund$microbe))
#> [1] 107
# DT::datatable(as.data.frame(pwy.abund$microbe))
# DT::datatable(as.data.frame(bug.abund$microbe))

# Inconsistency in lactobacillus. Between new and old taxonomies
# Metaphlan profiles
pwy.bugs.df <- as.data.frame(unique(pwy.abund$microbe))
colnames(pwy.bugs.df)[1] <- "PWY_BUG_ID"
pwy.bugs.df |>
  dplyr::filter(str_detect(PWY_BUG_ID, "actobaci"))
#>                          PWY_BUG_ID
#> 1         Lactobacillus_delbrueckii
#> 2           Lactobacillus_fermentum
#> 3             Lactobacillus_mucosae
#> 4     Lactobacillus_ruminis_CAG_367
#> 5             Lactobacillus_ruminis
#> 6           Lactobacillus_plantarum
#> 7               Lactobacillus_sakei
#> 8           Lactobacillus_johnsonii
#> 9             Lactobacillus_reuteri
#> 10        Lactobacillus_paragasseri
#> 11               Lactobacillus_oris
#> 12         Lactobacillus_kalixensis
#> 13           Lactobacillus_animalis
#> 14           Lactobacillus_curvatus
#> 15       Lactobacillus_parabuchneri
#> 16        Lactobacillus_acidophilus
#> 17          Lactobacillus_rhamnosus
#> 18         Lactobacillus_ultunensis
#> 19            Lactobacillus_gasseri
#> 20         Lactobacillus_salivarius
#> 21         Lactobacillus_fuchuensis
#> 22         Lactobacillus_amylovorus
#> 23 Lactobacillus_amylovorus_CAG_719
#> 24          Lactobacillus_gastricus
#> 25          Lactobacillus_vaginalis
#> 26          Lactobacillus_crispatus
#> 27              Lactobacillus_antri

# Metaphlan profiles
bug.abund.df <- as.data.frame(unique(bug.abund$microbe))
colnames(bug.abund.df)[1] <- "MPA_BUG_ID"

bug.abund.df |>
  dplyr::filter(str_detect(MPA_BUG_ID, "actobaci"))
#>                              MPA_BUG_ID
#> 1             Lactobacillus_delbrueckii
#> 2         Limosilactobacillus_fermentum
#> 3             Lactobacillus_paragasseri
#> 4                 Lactobacillus_gasseri
#> 5         Limosilactobacillus_vaginalis
#> 6              Limosilactobacillus_oris
#> 7                 Lactobacillus_rogosae
#> 8           Limosilactobacillus_mucosae
#> 9             Lactobacillus_acidophilus
#> 10              Latilactobacillus_sakei
#> 11              Lactobacillus_johnsonii
#> 12               Lactobacillus_crisptus
#> 13           Latilactobacillus_curvatus
#> 14            Ligilactobacillus_ruminis
#> 15         Ligilactobacillus_saerimneri
#> 16             Lactobacillus_kalixensis
#> 17         Ligilactobacillus_salivarius
#> 18        Limosilactobacillus_gastricus
#> 19          Limosilactobacillus_reuteri
#> 20           Ligilactobacillus_animalis
#> 21      Lentilactobacillus_parabuchneri
#> 22        Ligilactobacillus_acidipiscis
#> 23         Latilactobacillus_fuchuensis
#> 24             Lactobacillus_ultunensis
#> 25        Companilactobacillus_nodensis
#> 26 Fructilactobacillus_sanfranciscensis
#> 27             Lactobacillus_amylovorus
#> 28            Lentilactobacillus_kefiri
#> 29     Fructilactobacillus_fructivorans
#> 30   Companilactobacillus_versmoldensis
#> 31            Limosilactobacillus_antri
#> 32          Lentilactobacillus_buchneri
#> 33      Companilactobacillus_farciminis
#> 34        Lentilactobacillus_otakiensis
#> 35             Levilactobacillus_brevis
#> 36                  Lactobacillus_iners

# Plus Veillonella_dispar spelling different in abundance and pwy tables
# In abundance table it is Veillonella_dispr
# Some examples
# Limosilactobacillus_gastricus in Bugs table
# Lactobacillus_gastricus in PWY table
# Lactobacillus_amylovorus_CAG_719 in PWY
# Lactobacillus_amylovorus in Bugs *could be CAG diff?

# Lactobacillus_fuchuensis in PWY
# Latilactobacillus_fuchuensis in Bugs

# Lactobacillus_rhamnosus in PWY
# Lacticaseibacillus_rhamnosus in Bugs

devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23 ucrt)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.utf8
#>  ctype    English_United States.utf8
#>  tz       Europe/Berlin
#>  date     2022-07-30
#>  pandoc   2.17.1.1 @ C:/Program Files/RStudio/bin/quarto/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package                  * version  date (UTC) lib source
#>  AnnotationDbi              1.58.0   2022-04-26 [1] Bioconductor
#>  AnnotationHub              3.4.0    2022-04-26 [1] Bioconductor
#>  ape                        5.6-2    2022-03-02 [1] CRAN (R 4.2.0)
#>  assertthat                 0.2.1    2019-03-21 [1] CRAN (R 4.2.0)
#>  backports                  1.4.1    2021-12-13 [1] CRAN (R 4.2.0)
#>  beachmat                   2.12.0   2022-04-26 [1] Bioconductor
#>  beeswarm                   0.4.0    2021-06-01 [1] CRAN (R 4.2.0)
#>  Biobase                  * 2.56.0   2022-04-26 [1] Bioconductor
#>  BiocFileCache              2.4.0    2022-04-26 [1] Bioconductor
#>  BiocGenerics             * 0.42.0   2022-04-26 [1] Bioconductor
#>  BiocManager                1.30.18  2022-05-18 [1] CRAN (R 4.2.0)
#>  BiocNeighbors              1.14.0   2022-04-26 [1] Bioconductor
#>  BiocParallel               1.30.3   2022-06-07 [1] Bioconductor
#>  BiocSingular               1.12.0   2022-04-26 [1] Bioconductor
#>  BiocVersion                3.15.2   2022-03-29 [1] Bioconductor
#>  Biostrings               * 2.64.0   2022-04-26 [1] Bioconductor
#>  bit                        4.0.4    2020-08-04 [1] CRAN (R 4.2.0)
#>  bit64                      4.0.5    2020-08-30 [1] CRAN (R 4.2.0)
#>  bitops                     1.0-7    2021-04-24 [1] CRAN (R 4.2.0)
#>  blob                       1.2.3    2022-04-10 [1] CRAN (R 4.2.0)
#>  brio                       1.1.3    2021-11-30 [1] CRAN (R 4.2.0)
#>  broom                      0.8.0    2022-04-13 [1] CRAN (R 4.2.0)
#>  cachem                     1.0.6    2021-08-19 [1] CRAN (R 4.2.0)
#>  callr                      3.7.0    2021-04-20 [1] CRAN (R 4.2.0)
#>  cellranger                 1.1.0    2016-07-27 [1] CRAN (R 4.2.0)
#>  cli                        3.3.0    2022-04-25 [1] CRAN (R 4.2.0)
#>  cluster                    2.1.3    2022-03-28 [2] CRAN (R 4.2.1)
#>  codetools                  0.2-18   2020-11-04 [2] CRAN (R 4.2.1)
#>  colorspace                 2.0-3    2022-02-21 [1] CRAN (R 4.2.0)
#>  crayon                     1.5.1    2022-03-26 [1] CRAN (R 4.2.0)
#>  curatedMetagenomicData   * 3.4.2    2022-05-19 [1] Bioconductor
#>  curl                       4.3.2    2021-06-23 [1] CRAN (R 4.2.0)
#>  data.table               * 1.14.2   2021-09-27 [1] CRAN (R 4.2.0)
#>  DBI                        1.1.3    2022-06-18 [1] CRAN (R 4.2.0)
#>  dbplyr                     2.2.1    2022-06-27 [1] CRAN (R 4.2.1)
#>  DECIPHER                   2.24.0   2022-04-26 [1] Bioconductor
#>  decontam                   1.16.0   2022-04-26 [1] Bioconductor
#>  DelayedArray               0.22.0   2022-04-26 [1] Bioconductor
#>  DelayedMatrixStats         1.18.0   2022-04-26 [1] Bioconductor
#>  desc                       1.4.1    2022-03-06 [1] CRAN (R 4.2.0)
#>  devtools                   2.4.3    2021-11-30 [1] CRAN (R 4.2.0)
#>  digest                     0.6.29   2021-12-01 [1] CRAN (R 4.2.0)
#>  DirichletMultinomial       1.38.0   2022-04-26 [1] Bioconductor
#>  dplyr                    * 1.0.9    2022-04-28 [1] CRAN (R 4.2.0)
#>  ellipsis                   0.3.2    2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate                   0.15     2022-02-18 [1] CRAN (R 4.2.0)
#>  ExperimentHub              2.4.0    2022-04-26 [1] Bioconductor
#>  fansi                      1.0.3    2022-03-24 [1] CRAN (R 4.2.0)
#>  fastmap                    1.1.0    2021-01-25 [1] CRAN (R 4.2.0)
#>  filelock                   1.0.2    2018-10-05 [1] CRAN (R 4.2.0)
#>  forcats                  * 0.5.1    2021-01-27 [1] CRAN (R 4.2.0)
#>  fs                         1.5.2    2021-12-08 [1] CRAN (R 4.2.0)
#>  generics                   0.1.2    2022-01-31 [1] CRAN (R 4.2.0)
#>  GenomeInfoDb             * 1.32.2   2022-05-15 [1] Bioconductor
#>  GenomeInfoDbData           1.2.8    2022-06-11 [1] Bioconductor
#>  GenomicRanges            * 1.48.0   2022-04-26 [1] Bioconductor
#>  ggbeeswarm                 0.6.0    2017-08-07 [1] CRAN (R 4.2.0)
#>  ggplot2                  * 3.3.6    2022-05-03 [1] CRAN (R 4.2.0)
#>  ggrepel                    0.9.1    2021-01-15 [1] CRAN (R 4.2.0)
#>  glue                       1.6.2    2022-02-24 [1] CRAN (R 4.2.0)
#>  gridExtra                  2.3      2017-09-09 [1] CRAN (R 4.2.0)
#>  gtable                     0.3.0    2019-03-25 [1] CRAN (R 4.2.0)
#>  haven                      2.5.0    2022-04-15 [1] CRAN (R 4.2.0)
#>  highr                      0.9      2021-04-16 [1] CRAN (R 4.2.0)
#>  hms                        1.1.1    2021-09-26 [1] CRAN (R 4.2.0)
#>  htmltools                  0.5.2    2021-08-25 [1] CRAN (R 4.2.0)
#>  httpuv                     1.6.5    2022-01-05 [1] CRAN (R 4.2.0)
#>  httr                       1.4.3    2022-05-04 [1] CRAN (R 4.2.0)
#>  interactiveDisplayBase     1.34.0   2022-04-26 [1] Bioconductor
#>  IRanges                  * 2.30.0   2022-04-26 [1] Bioconductor
#>  irlba                      2.3.5    2021-12-06 [1] CRAN (R 4.2.0)
#>  jsonlite                   1.8.0    2022-02-22 [1] CRAN (R 4.2.0)
#>  KEGGREST                   1.36.2   2022-06-09 [1] Bioconductor
#>  knitr                      1.39     2022-04-26 [1] CRAN (R 4.2.0)
#>  later                      1.3.0    2021-08-18 [1] CRAN (R 4.2.0)
#>  lattice                    0.20-45  2021-09-22 [2] CRAN (R 4.2.1)
#>  lazyeval                   0.2.2    2019-03-15 [1] CRAN (R 4.2.0)
#>  lifecycle                  1.0.1    2021-09-24 [1] CRAN (R 4.2.0)
#>  lubridate                  1.8.0    2021-10-07 [1] CRAN (R 4.2.0)
#>  magrittr                   2.0.3    2022-03-30 [1] CRAN (R 4.2.0)
#>  MASS                       7.3-57   2022-04-22 [2] CRAN (R 4.2.1)
#>  Matrix                     1.4-1    2022-03-23 [2] CRAN (R 4.2.1)
#>  MatrixGenerics           * 1.8.1    2022-06-26 [1] Bioconductor
#>  matrixStats              * 0.62.0   2022-04-19 [1] CRAN (R 4.2.0)
#>  memoise                    2.0.1    2021-11-26 [1] CRAN (R 4.2.0)
#>  mgcv                       1.8-40   2022-03-29 [2] CRAN (R 4.2.1)
#>  mia                        1.4.0    2022-04-26 [1] Bioconductor
#>  mime                       0.12     2021-09-28 [1] CRAN (R 4.2.0)
#>  modelr                     0.1.8    2020-05-19 [1] CRAN (R 4.2.0)
#>  MultiAssayExperiment       1.22.0   2022-04-26 [1] Bioconductor
#>  munsell                    0.5.0    2018-06-12 [1] CRAN (R 4.2.0)
#>  nlme                       3.1-157  2022-03-25 [2] CRAN (R 4.2.1)
#>  permute                    0.9-7    2022-01-27 [1] CRAN (R 4.2.0)
#>  pillar                     1.7.0    2022-02-01 [1] CRAN (R 4.2.0)
#>  pkgbuild                   1.3.1    2021-12-20 [1] CRAN (R 4.2.0)
#>  pkgconfig                  2.0.3    2019-09-22 [1] CRAN (R 4.2.0)
#>  pkgload                    1.2.4    2021-11-30 [1] CRAN (R 4.2.0)
#>  plyr                       1.8.7    2022-03-24 [1] CRAN (R 4.2.0)
#>  png                        0.1-7    2013-12-03 [1] CRAN (R 4.2.0)
#>  prettyunits                1.1.1    2020-01-24 [1] CRAN (R 4.2.0)
#>  processx                   3.5.3    2022-03-25 [1] CRAN (R 4.2.0)
#>  promises                   1.2.0.1  2021-02-11 [1] CRAN (R 4.2.0)
#>  ps                         1.7.0    2022-04-23 [1] CRAN (R 4.2.0)
#>  purrr                    * 0.3.4    2020-04-17 [1] CRAN (R 4.2.0)
#>  R.cache                    0.15.0   2021-04-30 [1] CRAN (R 4.2.0)
#>  R.methodsS3                1.8.2    2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo                       1.25.0   2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils                    2.12.0   2022-06-28 [1] CRAN (R 4.2.1)
#>  R6                         2.5.1    2021-08-19 [1] CRAN (R 4.2.0)
#>  rappdirs                   0.3.3    2021-01-31 [1] CRAN (R 4.2.0)
#>  Rcpp                       1.0.8.3  2022-03-17 [1] CRAN (R 4.2.0)
#>  RCurl                      1.98-1.6 2022-02-08 [1] CRAN (R 4.2.0)
#>  readr                    * 2.1.2    2022-01-30 [1] CRAN (R 4.2.0)
#>  readxl                     1.4.0    2022-03-28 [1] CRAN (R 4.2.0)
#>  remotes                    2.4.2    2021-11-30 [1] CRAN (R 4.2.0)
#>  reprex                   * 2.0.1    2021-08-05 [1] CRAN (R 4.2.0)
#>  reshape2                   1.4.4    2020-04-09 [1] CRAN (R 4.2.0)
#>  rlang                      1.0.2    2022-03-04 [1] CRAN (R 4.2.0)
#>  rmarkdown                  2.14     2022-04-25 [1] CRAN (R 4.2.0)
#>  rprojroot                  2.0.3    2022-04-02 [1] CRAN (R 4.2.0)
#>  RSQLite                    2.2.14   2022-05-07 [1] CRAN (R 4.2.0)
#>  rstudioapi                 0.13     2020-11-12 [1] CRAN (R 4.2.0)
#>  rsvd                       1.0.5    2021-04-16 [1] CRAN (R 4.2.0)
#>  rvest                      1.0.2    2021-10-16 [1] CRAN (R 4.2.0)
#>  S4Vectors                * 0.34.0   2022-04-26 [1] Bioconductor
#>  ScaledMatrix               1.4.0    2022-04-26 [1] Bioconductor
#>  scales                     1.2.0    2022-04-13 [1] CRAN (R 4.2.0)
#>  scater                     1.24.0   2022-04-26 [1] Bioconductor
#>  scuttle                    1.6.2    2022-05-15 [1] Bioconductor
#>  sessioninfo                1.2.2    2021-12-06 [1] CRAN (R 4.2.0)
#>  shiny                      1.7.1    2021-10-02 [1] CRAN (R 4.2.0)
#>  SingleCellExperiment     * 1.18.0   2022-04-26 [1] Bioconductor
#>  sparseMatrixStats          1.8.0    2022-04-26 [1] Bioconductor
#>  stringi                    1.7.6    2021-11-29 [1] CRAN (R 4.2.0)
#>  stringr                  * 1.4.0    2019-02-10 [1] CRAN (R 4.2.0)
#>  styler                     1.7.0    2022-03-13 [1] CRAN (R 4.2.0)
#>  SummarizedExperiment     * 1.26.1   2022-04-29 [1] Bioconductor
#>  testthat                   3.1.4    2022-04-26 [1] CRAN (R 4.2.0)
#>  tibble                   * 3.1.7    2022-05-03 [1] CRAN (R 4.2.0)
#>  tidyr                    * 1.2.0    2022-02-01 [1] CRAN (R 4.2.0)
#>  tidyselect                 1.1.2    2022-02-21 [1] CRAN (R 4.2.0)
#>  tidytree                   0.3.9    2022-03-04 [1] CRAN (R 4.2.0)
#>  tidyverse                * 1.3.1    2021-04-15 [1] CRAN (R 4.2.0)
#>  treeio                     1.20.0   2022-04-26 [1] Bioconductor
#>  TreeSummarizedExperiment * 2.4.0    2022-04-26 [1] Bioconductor
#>  tzdb                       0.3.0    2022-03-28 [1] CRAN (R 4.2.0)
#>  usethis                    2.1.6    2022-05-25 [1] CRAN (R 4.2.0)
#>  utf8                       1.2.2    2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs                      0.4.1    2022-04-13 [1] CRAN (R 4.2.0)
#>  vegan                      2.6-2    2022-04-17 [1] CRAN (R 4.2.0)
#>  vipor                      0.4.5    2017-03-22 [1] CRAN (R 4.2.0)
#>  viridis                    0.6.2    2021-10-13 [1] CRAN (R 4.2.0)
#>  viridisLite                0.4.0    2021-04-13 [1] CRAN (R 4.2.0)
#>  withr                      2.5.0    2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun                       0.31     2022-05-10 [1] CRAN (R 4.2.0)
#>  xml2                       1.3.3    2021-11-30 [1] CRAN (R 4.2.0)
#>  xtable                     1.8-4    2019-04-21 [1] CRAN (R 4.2.0)
#>  XVector                  * 0.36.0   2022-04-26 [1] Bioconductor
#>  yaml                       2.3.5    2022-02-21 [1] CRAN (R 4.2.0)
#>  yulab.utils                0.0.4    2021-10-09 [1] CRAN (R 4.2.0)
#>  zlibbioc                   1.42.0   2022-04-26 [1] Bioconductor
#> 
#>  [1] C:/Users/shettys/AppData/Local/R/win-library/4.2
#>  [2] C:/Program Files/R/R-4.2.1/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────
Created on 2022-07-30 by the [reprex package](https://reprex.tidyverse.org/) (v2.0.1)

microsud avatar Jul 30 '22 08:07 microsud

Hi @microsud, thank you for providing a detailed example – it will be sometime before I can study the alleged issue. Just wanted to let you know I've seen it and will get back to you with an update as soon as I can.

schifferl avatar Aug 04 '22 13:08 schifferl

Hi @microsud, thank you for your patience regarding this reply. To answer your question quickly: as stated in the NEWS.md file, both "short" and "NCBI" row names are validated against NCBI Taxonomy. This sometimes changes their names and causes the discrepancy between pathway_abundance and relative_abundance row names that you have noted because "short" row names only apply to the relative_abundance data type. To understand the exact details, have a look at the rowData.R file. Otherwise, use the code below to pull out the "long" relative_abundance row names and compare them to the pathway_abundance row names. There are no differences when comparing this way, just some species without pathway matches (which is normal and expected). Hope this helps clarify the issue for you and thanks for using curatedMetagenomicData.

pathway_abundance <-
    curatedMetagenomicData::curatedMetagenomicData("LifeLinesDeep_2016.pathway_abundance", dryrun = FALSE)[[1]]

relative_abundance <-
    curatedMetagenomicData::curatedMetagenomicData("LifeLinesDeep_2016.relative_abundance", dryrun = FALSE)[[1]]

pathway_abundance <-
    base::rownames(pathway_abundance) |>
    stringr::str_extract("g__.+") |>
    stringr::str_replace("\\.s__", "|s__")

pathway_abundance <-
    pathway_abundance[stats::complete.cases(pathway_abundance)]

relative_abundance <-
    base::rownames(relative_abundance) |>
    stringr::str_extract("g__.+")

closest_match <-
    stringdist::amatch(relative_abundance, pathway_abundance, maxDist = Inf)

closest_match <-
    pathway_abundance[closest_match]

tibble::tibble(relative_abundance, closest_match) |>
    dplyr::mutate(distance = stringdist::stringdist(relative_abundance, closest_match)) |>
    dplyr::arrange(distance, relative_abundance) |>
    View()

schifferl avatar Sep 17 '22 16:09 schifferl

Dear @schifferl thanks for the clarification!

microsud avatar Sep 18 '22 16:09 microsud