phyloseq icon indicating copy to clipboard operation
phyloseq copied to clipboard

Why does `tax_table,data.frame-method` not use the existing row and column names?

Open mikemc opened this issue 7 years ago • 9 comments

Given that it seems natural for users to read their taxonomy info into a data frame with column names as the rank names and row names as taxa names, I'm wondering what the reason is behind the tax_table data.frame method dropping the existing names and replacing with the default "sp#" and "ta#"

mikemc avatar Apr 18 '19 12:04 mikemc

My motivation for asking is that I'm making tibble ("tbl_df") methods for otu_table, tax_table, and sample_data that can handle the taxa and sample names being a data frame column rather than rownames. I'm currently putting these in a simple add-on package but would create a pull request if there is interest in supporting this in phyloseq, so want to check if there is some reason that data frames as input are discouraged.

mikemc avatar Apr 18 '19 15:04 mikemc

Mike I support this idea and definitely want to go further down the tidyverse route, the question is more do we try to do many things at once or piecewise? Susan

On Thu, Apr 18, 2019 at 8:57 AM Michael McLaren [email protected] wrote:

My motivation for asking is that I'm making tibble ("tbl_df") methods for otu_table, tax_table, and sample_data that can handle the taxa and sample names being a data frame column rather than rownames. I'm currently putting these in a simple add-on package but would create a pull request if there is interest in supporting this in phyloseq, so want to check if there is some reason that data frames as input are discouraged.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/joey711/phyloseq/issues/1119#issuecomment-484572105, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJFZPKKFNMB3WARPYDXPRLPRCK65ANCNFSM4HG4JRCQ .

-- Susan Holmes John Henry Samter Fellow in Undergraduate Education Professor, Statistics 2017-2018 CASBS Fellow, Sequoia Hall, 390 Serra Mall Stanford, CA 94305 http://www-stat.stanford.edu/~susan/

spholmes avatar Apr 18 '19 17:04 spholmes

The quick answer to your issue title/description is that the following works and keeps the index names:

physeq %>% tax_table() %>% as("matrix") %>% data.frame() %>% head

taxonomyTable class is based on "matrix", so something fishy is going on if you skip the step of handing it off first as a matrix, then coercing to a data.frame. I noticed a performance hit to skipping the matrix step as well, though I also admit that I can't reproduce your issue. I get the index names in both cases... Typically the dummy index names only occur in the direction of coercing an object to a phyloseq component, and this should only happen when the immediate argument to the constructor is missing index names (e.g. dimnames(x) returns NULL).

The broader question of pull request, etc.: I'm in favor of pull request for things that are natural extensions that require backward-compatibility maneuvers. I'm open to a new refactored package that avoids some of my worst early mistakes and plays nice with tidyverse, data.table, and Bioconductor core. Happy to chat more about that in a different channel.

joey711 avatar Apr 18 '19 19:04 joey711

@joey711 Sorry, I should have posted a code example to be clearer about what I was asking. I'm asking about going in the other direction---creating a taxonomyTable from a data.frame, which uses the S4 method tax_table() defined for the data.frame class input.

library(phyloseq)
library(dplyr)
data(GlobalPatterns)
# Get a data.frame with tax information to use as an example;
# it has row names and column names for taxa and rank names
df <- GlobalPatterns %>% tax_table() %>% as("matrix") %>% data.frame() %>% head
# Now create a taxonomyTable. The row names and column names get discarded:
tax_table(df)
#> Taxonomy Table:     [6 taxa by 7 taxonomic ranks]:
#>     ta1       ta2             ta3            ta4           
#> sp1 "Archaea" "Crenarchaeota" "Thermoprotei" NA            
#> sp2 "Archaea" "Crenarchaeota" "Thermoprotei" NA            
#> sp3 "Archaea" "Crenarchaeota" "Thermoprotei" "Sulfolobales"
#> sp4 "Archaea" "Crenarchaeota" "Sd-NA"        NA            
#> sp5 "Archaea" "Crenarchaeota" "Sd-NA"        NA            
#> sp6 "Archaea" "Crenarchaeota" "Sd-NA"        NA            
#>     ta5             ta6          ta7                       
#> sp1 NA              NA           NA                        
#> sp2 NA              NA           NA                        
#> sp3 "Sulfolobaceae" "Sulfolobus" "Sulfolobusacidocaldarius"
#> sp4 NA              NA           NA                        
#> sp5 NA              NA           NA                        
#> sp6 NA              NA           NA                      

along with the warning

#> Warning message:
#> |In .local(object) : Coercing from data.frame class to character matrix
#> |prior to building taxonomyTable.
#> |This could introduce artifacts.
#> |Check your taxonomyTable, or coerce to matrix manually.

The warning and discarding of the column names and rownames indicates that creating taxonomyTable's from data.frame's is not supported or encouraged, and that is what I'm wondering if there was a reason for.

mikemc avatar Apr 21 '19 18:04 mikemc

@mikemc truth is I don't recall the original motivator for picking "matrix" over data.frame in this case. It does have the advantage of the column data classes all being character, which avoids some fiddly errors that come up with data.frames. This might be the reason. It does make sense to include (or fix?) the coercion method dispatched at the step: Coercing from data.frame class to character matrix that you've MRE-ed above. My guess without peeking at the old code is that I didn't include an explicit coercion method for data.frame --> taxonomyTable, and the default behavior of the matrix conversion is to drop the index names, which is weird. Whatever the actual behavior, it should be an easy fix.

joey711 avatar Apr 22 '19 23:04 joey711

@joey711 So is there a solution? I have the same problem. I have my otu and tax stored as data.frames as well and get the same error messages. Please note. I am an absolute beginner in R.

Thanks for the help!

mr2raccoon avatar Apr 05 '21 10:04 mr2raccoon

@mr2raccoon You should be able to simply convert your data frames to matrixes first, with as(df, "matrix"), and then supply it to the phyloseq functions.

mikemc avatar Apr 05 '21 14:04 mikemc

Thanks for the reply! I actually still managed that yesterday by myself! Steep learning curve. Frustrating but rewarding.

mr2raccoon avatar Apr 06 '21 07:04 mr2raccoon

Thanks for the reply! I actually still managed that yesterday by myself! Steep learning curve. Frustrating but rewarding.

Can you please share how had you solved the issue?

SelfShubham avatar Dec 02 '23 13:12 SelfShubham