metacoder icon indicating copy to clipboard operation
metacoder copied to clipboard

parse_phyloseq and n_obs as node_size question

Open pedres opened this issue 4 years ago • 8 comments

Hi, I am just started to work with metacoder, but I have a simple question. When I import a phyloseq object with parse phyloseq I can see the all components correctly imported. However, in the otu_table there is a taxon_id column., and the number of taxon_ids (865) is lower than the number of ASVs (2106) phyloseq_caliginosaNOPOOL.zip

  # A tibble: 6,217 x 49
    taxon_id otu_id DL151 DL152 DL152g 
    <chr>    <chr>  <int> <int>  <int> 
  1 asd      ASV17~     0     0      0    
  2 asd      ASV352     0     0      0
  3 asd      ASV39~     0     0      0

When I look to the tax_table of physeq object these three ASVs do not have the same taxonomy ASV17 Bacteria Acidobacteria Subgroup_6 unclassified unclassified unclassified ASV39 Bacteria Bacteroidetes Bacteroidia Flavobacteriales Flavobacteriaceae Flavobacterium ASV352 Bacteria WPS-2 unclassified unclassified unclassified unclassified

What taxon_id represent? I think that there is not any problem since heat_trees plot represent monophyletic bacterial clades.

Another question is the node_size option. I am following the example tutorial to compare any number of treatments or groups

para_metacoder<-parse_phyloseq(datos_filt) ## attached as zip para_metacoder$data$tax_abund <- calc_taxon_abund(para_metacoder, "otu_table") para_metacoder$data$type_abund <- calc_group_mean(para_metacoder, "tax_abund",cols = para_metacoder$data$sample_data$sample_id, groups =para_metacoder$data$sample_data$type)

para_metacoder$data$tax_occ <- calc_n_samples(para_metacoder, "tax_abund", cols = para_metacoder$data$sample_data$sample_id, groups =para_metacoder$data$sample_data$type)

heat_tree(para_metacoder, node_label = taxon_names, node_size = n_obs, node_color = soil, node_size_axis_label = "OTU count", node_color_axis_label = "Samples with reads", layout = "davidson-harel", # The primary layout algorithm initial_layout = "reingold-tilford")

this first plot works perfectly

para_metacoder$data$diff_table <- compare_groups(para_metacoder, data = "tax_abund", cols = para_metacoder$data$sample_data$sample_id, groups = para_metacoder$data$sample_data$type)

para_metacoder <- mutate_obs(para_metacoder, "diff_table", wilcox_p_value = p.adjust(wilcox_p_value, method = "fdr"))

para_metacoder$data$diff_table$log2_median_ratio[para_metacoder$data$diff_table$wilcox_p_value > 0.05] <- 0

para_metacoder %>% taxa::filter_taxa(taxon_ranks == "Class", supertaxa = TRUE, reassign_obs = FALSE) %>% ### pongo mutate_obs("cleaned_names", gsub(taxon_names, pattern = "\[|\]", replacement = "")) %>% taxa::filter_taxa(grepl(cleaned_names, pattern = "^[a-zA-Z]+$"), reassign_obs = FALSE) %>% heat_tree_matrix(data = "diff_table", node_label = cleaned_names, node_size = n_obs, # number of OTUs node_color = log2_median_ratio, # difference between groups node_color_trans = "linear", node_color_interval = c(-8, 8), # symmetric interval edge_color_interval = c(-8, 8), # symmetric interval node_color_range = diverging_palette(), # diverging colors node_size_axis_label = "OTU count", node_color_axis_label = "Log 2 ratio of median counts", layout = "da", initial_layout = "re", key_size = 0.67, seed = 2) This plots the main heat_tree and the pairwise comparison heat_trees but all nodes have the same size (see attached plot). I cannot find were is the mistake I am doing. comparing_heat_trees

Thank you very much in advance,

Manuel

pedres avatar Apr 27 '20 10:04 pedres

When I import a phyloseq object with parse phyloseq I can see the all components correctly imported. However, in the otu_table there is a taxon_id column., and the number of taxon_ids (865) is lower than the number of ASVs (2106)

This is normal, because there can be multiple ASVs assigned to the same taxon. Each unique taxon is stored once in the object, unlike phyloseq's taxonomy table, which has is one row per ASV and the same value may appear in multiple rows.

When I look to the tax_table of physeq object these three ASVs do not have the same taxonomy

I could not replicate that problem with the data you sent me

> library(metacoder)
> pyobj <- readRDS('~/Downloads/phyloseq_caliginosaNOPOOL.rds')
> tmobj <- parse_phyloseq(pyobj)
> pyobj@tax_table[1:4, ]
Taxonomy Table:     [4 taxa by 6 taxonomic ranks]:
        Kingdom    Phylum         Class          Order          Family         Genus         
ASV3885 "Bacteria" "unclassified" "unclassified" "unclassified" "unclassified" "unclassified"
ASV6082 "Bacteria" "unclassified" "unclassified" "unclassified" "unclassified" "unclassified"
ASV2659 "Bacteria" "WPS-2"        "unclassified" "unclassified" "unclassified" "unclassified"
ASV1700 "Bacteria" "WPS-2"        "unclassified" "unclassified" "unclassified" "unclassified"
> tmobj$data$otu_table[1:4, 1:2]
# A tibble: 4 x 2
  taxon_id otu_id 
  <chr>    <chr>  
1 bba      ASV3885
2 bba      ASV6082
3 bbb      ASV2659
4 bbb      ASV1700
> classifications(tmobj)[tmobj$data$otu_table$taxon_id[1:4]]
                                                                        bba                                                                         bba 
"Bacteria;unclassified;unclassified;unclassified;unclassified;unclassified" "Bacteria;unclassified;unclassified;unclassified;unclassified;unclassified" 
                                                                        bbb                                                                         bbb 
       "Bacteria;WPS-2;unclassified;unclassified;unclassified;unclassified"        "Bacteria;WPS-2;unclassified;unclassified;unclassified;unclassified" 

If you send me code to replicate the problem, I can look into it more.

What taxon_id represent?

It is a unique identifier made for each unique classification. It is used instead of taxon name because the same name can be used for different taxa. For example, an "unclassified" bacteria is different than a "unclassified" fungi.

This plots the main heat_tree and the pairwise comparison heat_trees but all nodes have the same size (see attached plot). I cannot find were is the mistake I am doing.

That is odd. The legend should reflect the size range but is does not. Anyway, I think you issue is that you filtered out all of you ASVs from your "otu_table" with the reassign_obs = FALSE and n_obs is counting the number of rows in that table (because it is the first table) for each taxon, but it all 0. reassign_obs = FALSE is needed when filtering taxon data for plotting, like that in diff_table, but is not what you want for things like OTUs/ASVs, where you want them reassinged to existing taxa if some ranks are filtered out. Unfortunatly you need both type of data for that plot so you will need the more complicated reassign_obs = c(diff_table = FALSE). Try that and let me know if that fixes the issue.

zachary-foster avatar Apr 28 '20 10:04 zachary-foster

Hi,

Thank you very much for your quick answer. It works perfectly. The former problem regarding taxon_id and otu_id classification was because I assume that ASV17~ was ASV17 and not ASV1700 (for example). I was also trying to adapt the diff_table procedure to include DESEq2 output, in a similar way as in issue #240 ERRROR in heat_tree_matrix - missing value where TRUE/FALSE needed. I am using the same data as in the former question plus the deseq dataset attacched But despite I have a similar table I get two different errors when plotting. . I was trying to find errors due to mitchmatch in taxon_id between the taxmap and the diff_table made from deseq2 output.

deseq2_object.zip

Code: res_deseq2<-readRDS("deseq2_object.rds")

maike pairwise comparisons

soil_vs_gut<-DESeq2::results(res_deseq2,cooksCutoff = FALSE, contrast=c("type","gut","soil")) ### soil_vs_gut$otu_id<-rownames(soil_vs_gut) soil_vs_gut$term<-rep("soil_vs_gut",length(rownames(soil_vs_gut))) soil_vs_gut<-data.frame(soil_vs_gut)

gut_vs_cast<-DESeq2::results(res_deseq2,cooksCutoff = FALSE, contrast=c("type","cast","gut")) ### gut_vs_cast$otu_id<-rownames(soil_vs_gut) gut_vs_cast$term<-rep("gut_vs_cast",length(rownames(gut_vs_cast))) gut_vs_cast<-data.frame(gut_vs_cast)

soil_vs_cast<-DESeq2::results(res_deseq2,cooksCutoff = FALSE, contrast=c("type","cast","soil")) ### soil_vs_cast$otu_id<-rownames(soil_vs_cast) soil_vs_cast$term<-rep("soil_vs_cast",length(rownames(soil_vs_cast))) soil_vs_cast<-data.frame(soil_vs_cast)

make taxmap object

pyobj <- readRDS('~/Downloads/phyloseq_caliginosaNOPOOL.rds') para_deseq2<-parse_phyloseq(pyobj)

get taxon_id and otu_id data

taxons<-data.frame(para_deseq2$data$otu_table) taxons<-taxons[,1:2]

arrange data to paste data.frames

taxons<-plyr::arrange(taxons, otu_id) soil_vs_gut<-plyr::arrange(soil_vs_gut, otu_id) soil_vs_cast<-plyr::arrange(soil_vs_cast, otu_id) gut_vs_cast<-plyr::arrange(gut_vs_cast, otu_id)

soil_vs_gut<-data.frame(soil_vs_gut,taxons) soil_vs_cast<-data.frame(soil_vs_cast,taxons) gut_vs_cast<-data.frame(gut_vs_cast,taxons)

res_deseq2<-data.frame(rbind(soil_vs_gut,soil_vs_cast,gut_vs_cast)) res_deseq2$treatment_1 <- gsub("vs.+$", "", res_deseq2$term) ## split pairwise factor "term" into its two components res_deseq2$treatment_2 <- gsub("^.+vs", "", res_deseq2$term) res_deseq2$padjustBIEN<-p.adjust(res_deseq2$pvalue,method="BH") res_deseq2$log2FoldChange[res_deseq2$padjustBIEN > 0.05] <- 0

get intestering columns

res_deseq2<-res_deseq2[,c(9,11,12,2)]

make diff_table

diff_deseq2<-tibble::as_tibble(res_deseq2)

check taxon_ids are the same across all data in taxmap object

para_deseq2$data$tax_abund <- calc_taxon_abund(para_deseq2, "otu_table", cols = para_deseq2$data$sample_data$sample_id) para_deseq2$data$type_abund <- calc_group_mean(para_deseq2, "tax_abund",cols = para_deseq2$data$sample_data$sample_id, groups =para_deseq2$data$sample_data$type) para_deseq2$data$diff_table<-diff_deseq2 para_deseq2$data$tax_abund<-dplyr::arrange(para_deseq2$data$tax_abund,taxon_id) head(para_deseq2$data$tax_abund) para_deseq2$data$type_abund<-dplyr::arrange(para_deseq2$data$type_abund,taxon_id) head(para_deseq2$data$type_abund) para_deseq2$data$diff_table<-dplyr::arrange(para_deseq2$data$diff_table,taxon_id) head(para_deseq2$data$diff_table) para_deseq2$data$otu_table<-dplyr::arrange(para_deseq2$data$otu_table,taxon_id) head(para_deseq2$data$otu_table)

check taxon_id in the original taxmap object is the same in diff_table

identical(sort(unique(para_deseq2$data$diff_table$taxon_id)),sort(unique(para_deseq2$data$otu_table$taxon_id)))

TRUE

heat_tree_matrix(para_deseq2, data = "diff_table", node_label = taxon_names, node_size = n_obs, # number of OTUs node_color = log2FoldChange, # difference between groups node_color_trans = "linear", node_color_interval = c(-8, 8), # symmetric interval edge_color_interval = c(-8, 8), # symmetric interval node_color_range = diverging_palette(), # diverging colors node_size_axis_label = "OTU count", node_color_axis_label = "Log 2 ratio of median counts", layout = "da", initial_layout = "re", key_size = 0.67, seed = 2)

Error: All pairs being compared should have one value per taxon. The following do not: soil vs. gut (2106), soil vs. cast (2106), gut vs. cast (2106)

But when type table(paste(para_deseq2$data$diff_table$treatment_1, para_deseq2$data$diff_table$treatment_2))

gut cast soil cast soil gut 2106 2106 2106

para_deseq2 %>% taxa::filter_taxa(taxon_ranks == "Class", supertaxa = TRUE, reassign_obs = c(diff_table = FALSE)) %>% heat_tree_matrix(data = "diff_table", node_label = taxon_names, node_size = n_obs, # number of OTUs node_color = log2FoldChange, # difference between groups node_color_trans = "linear", node_color_interval = c(-8, 8), # symmetric interval edge_color_interval = c(-8, 8), # symmetric interval node_color_range = diverging_palette(), # diverging colors node_size_axis_label = "OTU count", node_color_axis_label = "Log 2 ratio of median counts", layout = "da", initial_layout = "re", key_size = 0.67, seed = 2)

Error in utils::combn(seq_along(treatments), 2) : n < m In addition: Warning messages: 1: There is no "taxon_id" column in the data set "3", so there are no taxon IDs. 2: The data set "4" is named, but not named by taxon ids.

pedres avatar Apr 28 '20 12:04 pedres

Hi,

I finally was able to solve it following the question and solution posted in thread #240

pedres avatar Apr 30 '20 08:04 pedres

Good, let me know if you have more questions.

zachary-foster avatar Apr 30 '20 21:04 zachary-foster

Hi Zachary,

I wonder if the approach I used is the same as running deseq from a physeq object because running deseq from physeq object you get a results table with 2106 ASVs x 3 comparisons (a factor with three levels) but running deseq from metacoder I get a table with 2,595 rows (865 taxa x 3 comparisons). I have tried to adapt the code below to use the taxmap otu_table in the deseq analysis and it works but when I tried to plot it I get the error Error: All pairs being compared should have one value per taxon. The following do not: soil vs. gut (2106), soil vs. cast (2106), gut vs. cast (2106)

Curiosly, when I add a "Species" level in the tax_table, which are the ASVs identities it works, but I would expect to find 2106 taxon_ids, which in fact I have, but I have 2971 taxa. This works if I use a tax_abund table, but again I do not know why I have 2971 taxa instead the original 2106 ASVs.

datos2<-pyobj Species<-rownames(tax_table(datos2)) tax_table(datos2)<-cbind(tax_table(datos2),Species) para_deseq21<-parse_phyloseq(datos2) para_deseq21$data$tax_abund2 <- calc_taxon_abund(para_deseq21, "otu_table",cols = para_deseq21$data$sample_data$sample_id) sd2 <- para_deseq21$data$sample_data deseq_tax_data2 <- DESeq2::DESeqDataSetFromMatrix(countData = para_deseq21$data$tax_abund2 [,-1], colData = sd, design = ~ type) gm_mean = function(x, na.rm=TRUE){ exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x)) } geoMeans_reactoresDES_X2<-apply(DESeq2::counts(deseq_tax_data2), 1, gm_mean) deseq_tax_data2 = DESeq2::estimateSizeFactors(deseq_tax_data2,type = "poscount", geoMeans = geoMeans_reactoresDES_X2) deseq_tax_data2<-DESeq2::DESeq(deseq_tax_data2, test="Wald", fitType="parametric",parallel=F)

tax_GvS12 <- cbind(taxon_id = para_deseq21$data$tax_abund2$taxon_id, # taxon id needed for plotting DESeq2::results(deseq_tax_data2, cooksCutoff = FALSE, contrast=c("type","gut","soil"))) tax_CvS12 <- cbind(taxon_id = para_deseq21$data$tax_abund2$taxon_id, # taxon id needed for plotting DESeq2::results(deseq_tax_data2, cooksCutoff = FALSE, contrast=c("type","cast","soil"))) tax_GvC12 <- cbind(taxon_id = para_deseq21$data$tax_abund2$taxon_id, # taxon id needed for plotting DESeq2::results(deseq_tax_data2, cooksCutoff = FALSE, contrast=c("type","cast","gut")))

tax_GvS12 <- tidyr::as_tibble(tax_GvS12) tax_GvS12$treatment_1 <- "gut" tax_GvS12$treatment_2 <- "soil"

tax_CvS12 <- tidyr::as_tibble(tax_CvS12) tax_CvS12$treatment_1 <- "cast" tax_CvS12$treatment_2 <- "soil"

tax_GvC12 <- tidyr::as_tibble(tax_GvC12) tax_GvC12$treatment_1 <- "cast" tax_GvC12$treatment_2 <- "gut"

create new list for para_deseq2$data

testM12 <- do.call ("rbind",list(tax_GvS12,tax_CvS12,tax_GvC12)) testM12$treatment_1 <- as.factor (testM12$treatment_1) # tried also as letting them as.character, did not work testM12$treatment_2 <- as.factor (testM12$treatment_2) testM12$padjustBIEN<-p.adjust(testM12$pvalue,method="BH") testM12$log2FoldChange[testM12$padjustBIEN > 0.05] <- 0 #assign to obj$data para_soilgut$data$testM <- testM12

adapted code

pyobj <- readRDS('~/Downloads/phyloseq_caliginosaNOPOOL.rds') para_deseq2<-parse_phyloseq(pyobj ) para_deseq2$data$tax_abund <- calc_taxon_abund(para_deseq2, "otu_table", cols = para_deseq2$data$sample_data$sample_id) pyobj <- readRDS('~/Downloads/phyloseq_caliginosaNOPOOL.rds') para_deseq2<-parse_phyloseq(pyobj) sd <- para_deseq2$data$sample_data all(sd$sample_id == colnames(para_deseq2$data$tax_abund)[-1])

Convert data to DESeq2 format and do differential abundance analysis

deseq_tax_data <- DESeq2::DESeqDataSetFromMatrix(countData = para_deseq2$data$tax_abund [-1], colData = sd, design = ~ type)

gm_mean = function(x, na.rm=TRUE){ exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x)) } geoMeans_reactoresDES_X<-apply(DESeq2::counts(deseq_tax_data), 1, gm_mean) deseq_tax_data = DESeq2::estimateSizeFactors(deseq_tax_data,type = "poscount", geoMeans = geoMeans_reactoresDES_X) deseq_tax_data<-DESeq2::DESeq(deseq_tax_data, test="Wald", fitType="parametric",parallel=F)

tax_GvS <- cbind(taxon_id = para_deseq2$data$tax_abund$taxon_id, # taxon id needed for plotting DESeq2::results(deseq_tax_data, cooksCutoff = FALSE, contrast=c("type","gut","soil"))) tax_CvS <- cbind(taxon_id = para_deseq2$data$tax_abund$taxon_id, # taxon id needed for plotting DESeq2::results(deseq_tax_data, cooksCutoff = FALSE, contrast=c("type","cast","soil"))) tax_GvC <- cbind(taxon_id = para_deseq2$data$tax_abund$taxon_id, # taxon id needed for plotting DESeq2::results(deseq_tax_data, cooksCutoff = FALSE, contrast=c("type","cast","gut")))

tax_GvS <- tidyr::as_tibble(tax_GvS) tax_GvS$treatment_1 <- "gut" tax_GvS$treatment_2 <- "soil"

tax_CvS <- tidyr::as_tibble(tax_CvS) tax_CvS$treatment_1 <- "cast" tax_CvS$treatment_2 <- "soil"

tax_GvC <- tidyr::as_tibble(tax_GvC) tax_GvC$treatment_1 <- "cast" tax_GvC$treatment_2 <- "gut"

create new list for para_deseq2$data

testM <- do.call ("rbind",list(tax_GvS,tax_CvS,tax_GvC)) testM$treatment_1 <- as.factor (testM$treatment_1) # tried also as letting them as.character, did not work testM$treatment_2 <- as.factor (testM$treatment_2) testM$padjustBIEN<-p.adjust(testM$pvalue,method="BH") testM$log2FoldChange[testM$padjustBIEN > 0.05] <- 0 #assign to obj$data para_deseq2$data$testM <- testM

adapted code

sd <- para_deseq2$data$sample_data deseq_tax_data1 <- DESeq2::DESeqDataSetFromMatrix(countData = para_deseq2$data$otu_table [,-(1:2)], colData = sd, design = ~ type)

gm_mean = function(x, na.rm=TRUE){ exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x)) } geoMeans_reactoresDES_X1<-apply(DESeq2::counts(deseq_tax_data1), 1, gm_mean) deseq_tax_data1 = DESeq2::estimateSizeFactors(deseq_tax_data1,type = "poscount", geoMeans = geoMeans_reactoresDES_X1) deseq_tax_data1<-DESeq2::DESeq(deseq_tax_data1, test="Wald", fitType="parametric",parallel=F)

tax_GvS1 <- cbind(taxon_id = para_deseq21$data$otu_table$taxon_id, # taxon id needed for plotting DESeq2::results(deseq_tax_data1, cooksCutoff = FALSE, contrast=c("type","gut","soil"))) tax_CvS1 <- cbind(taxon_id = para_deseq21$data$otu_table$taxon_id, # taxon id needed for plotting DESeq2::results(deseq_tax_data1, cooksCutoff = FALSE, contrast=c("type","cast","soil"))) tax_GvC1 <- cbind(taxon_id = para_deseq21$data$otu_table$taxon_id, # taxon id needed for plotting DESeq2::results(deseq_tax_data1, cooksCutoff = FALSE, contrast=c("type","cast","gut")))

tax_GvS1 <- tidyr::as_tibble(tax_GvS1) tax_GvS1$treatment_1 <- "gut" tax_GvS1$treatment_2 <- "soil"

tax_CvS1 <- tidyr::as_tibble(tax_CvS1) tax_CvS1$treatment_1 <- "cast" tax_CvS1$treatment_2 <- "soil"

tax_GvC1 <- tidyr::as_tibble(tax_GvC1) tax_GvC1$treatment_1 <- "cast" tax_GvC1$treatment_2 <- "gut"

create new list for para_deseq2$data

testM1 <- do.call ("rbind",list(tax_GvS1,tax_CvS1,tax_GvC1)) testM1$treatment_1 <- as.factor (testM1$treatment_1) # tried also as letting them as.character, did not work testM1$treatment_2 <- as.factor (testM1$treatment_2) testM1$padjustBIEN<-p.adjust(testM1$pvalue,method="BH")

pedres avatar May 01 '20 10:05 pedres

I think the issue is that you are trying to plot per-ASV results on a heat tree, which only supports per-taxon data. Which is your goal? To look at deferentially abundant taxa or deferentially abundant ASVs? If ASVs, how do you want to encode that information per-taxon? Mean difference? Median? Number of deferentially abundant ASVs per taxon, etc?

A way I like to approach this is to add the ASV as a level in the taxonomy when you do the analysis, so that below species is a "taxonomic rank" called ASV. So before you parse your data to a taxmap, add another taxonomic rank to the input data and put the names of the ASVs in there and you will do both the per-taxon and per-ASV analysis at the same time.

You can try this code I made to run deseq2 from a taxmap object. I have not gotten around to testing it enough to include it in the package yet, but it seems to work as far as I can tell. Might give you some ideas anyway.

#' Differential abundance with DESeq2
#'
#' EXPERIMENTAL: This function is still being tested and developed; use with caution. Uses the
#' \code{\link{DESeq2}} package to conduct differential abundance analysis of count data. Counts can
#' be of OTUs/ASVs or taxa. The plotting function \code{\link{heat_tree_matrix}} is useful for
#' visualizing these results. See details section below for considerations on preparing data for
#' this analysis.
#'
#' Data should be raw read counts, not rarefied, converted to proportions, or modified with any
#' other technique designed to correct for sample size since \code{\link{DESeq2}} is designed to be
#' used with count data and takes into accoutn unequal sample size when determining differential
#' abundance. Warnings will be given if the data is not integers or all sample sizes are equal.
#'
#' @param obj A \code{\link[taxa]{taxmap}} object
#' @param data The name of a table in \code{obj} that contains data for each sample in columns.
#' @param cols The names/indexes of columns in \code{data} to use. By default, all numeric columns
#'   are used. Takes one of the following inputs: \describe{ \item{TRUE/FALSE:}{All/No columns will
#'   used.} \item{Character vector:}{The names of columns to use} \item{Numeric vector:}{The indexes
#'   of columns to use} \item{Vector of TRUE/FALSE of length equal to the number of columns:}{Use
#'   the columns corresponding to \code{TRUE} values.} }
#' @param groups A vector defining how samples are grouped into "treatments". Must be the same order
#'   and length as \code{cols}.
#' @param other_cols If \code{TRUE}, preserve all columns not in \code{cols} in the output. If
#'   \code{FALSE}, dont keep other columns. If a column names or indexes are supplied, only preserve
#'   those columns.
#' @param lfc_shrinkage What technique to use to adjust the log fold change results for low counts.
#'   Useful for ranking and visualizing log fold changes. Must be one of the following:
#' \describe{
#'   \item{'none'}{No log fold change adjustments.}
#'   \item{'normal'}{The original DESeq2 shrinkage estimator}
#'   \item{'ashr'}{Adaptive shrinkage estimator from the \code{\link{ashr}} package, using a fitted mixture of normals prior.}
#' }
#' @param ... Passed to \code{\link[DESeq2]{results}} if the \code{lfc_shrinkage} option is "none"
#'   and to \code{\link[DESeq2]{lfcShrink}} otherwise.
#' 
#' @return A tibble with at least the taxon ID of the thing tested, the groups compared, and the
#'   DESeq2 results. The \code{log2FoldChange} values will be positive if \code{treatment_1} is more
#'   abundant and \code{treatment_2}.
#'
#' @family calculations
#'
#' @examples
#' \dontrun{
#' # Parse data for plotting
#' x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";",
#'                    class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"),
#'                    class_regex = "^(.+)__(.+)$")
#'
#' # Get per-taxon counts
#' x$data$tax_table <- calc_taxon_abund(x, data = "tax_data", cols = hmp_samples$sample_id)
#'
#' # Calculate difference between groups
#' x$data$diff_table <- calc_diff_abund_deseq2(x, data = "tax_table",
#'                                     cols = hmp_samples$sample_id,
#'                                     groups = hmp_samples$body_site)
#'                                     
#' # Plot results (might take a few minutes)
#' heat_tree_matrix(x,
#'                  data = "diff_table",
#'                  node_size = n_obs,
#'                  node_label = taxon_names,
#'                  node_color = ifelse(is.na(padj) | padj > 0.05, 0, log2FoldChange),
#'                  node_color_range = diverging_palette(),
#'                  node_color_trans = "linear",
#'                  node_color_interval = c(-3, 3),
#'                  edge_color_interval = c(-3, 3),
#'                  node_size_axis_label = "Number of OTUs",
#'                  node_color_axis_label = "Log2 fold change")
#'
#' }
#'
#' @export
calc_diff_abund_deseq2 <- function(obj, data, cols, groups, other_cols = FALSE, 
                                   lfc_shrinkage = c('none', 'normal', 'ashr'), ...) {
  
  # Check that DESeq2 is intalled
  if (! requireNamespace("DESeq2", quietly = TRUE)) {
    stop('The "DESeq2" package needs to be installed for this function to work.',
         call. = FALSE)
  }
  
  # Parse options
  lfc_shrinkage <- match.arg(lfc_shrinkage)

  # Get abundance by sample data
  abund_data <- metacoder:::get_taxmap_table(obj, data)
  
  # Parse columns to use
  cols <- metacoder:::get_numeric_cols(obj, data, cols)
  
  # Check that columns contain integers only
  col_is_int <- vapply(cols, FUN.VALUE = logical(1), function(c) {
    all(abund_data[[c]] %% 1 == 0)
  })
  non_int_cols <- cols[! col_is_int]
  if (length(non_int_cols) > 0) {
    stop(call. = FALSE,
         'All data given to DESeq2 should be untransformed counts. The following columns contain non-integers:\n',
         limited_print(type = 'silent', non_int_cols, prefix = '  '))
  }
  
  # Check for equal sample sizes
  if (length(unique(colSums(abund_data[cols]))) == 1) {
    warning(call. = FALSE,
            'All columns have equal counts, suggesting counts were normalized (e.g. rarefied) to correct for sample size variation.',
            ' DESeq2 is designed to be used with unnormalized data.',
            ' Use untransformed counts if available.')
  }
  
  # Check groups option
  groups <- metacoder:::check_option_groups(groups, cols)
  
  # Find other columns
  #   These might be added back to the output later
  other_cols <- metacoder:::get_taxmap_other_cols(obj, data, cols, other_cols)
  
  # Get every combination of groups to compare
  combinations <- t(utils::combn(unique(groups), 2))
  combinations <- lapply(seq_len(nrow(combinations)), function(i) combinations[i, ])
  
  # Format data for DESeq2
  metadata <- data.frame(group = groups)
  deseq_data <- DESeq2::DESeqDataSetFromMatrix(countData = abund_data[cols],
                                               colData = metadata,
                                               design = ~ group)
  
  # Run DESeq2
  deseq_result <- DESeq2::DESeq(deseq_data)
  
  # Make function to compare one pair of groups
  one_comparison <- function(treat_1, treat_2) {
    
    # Extract results for a single pair
    if (lfc_shrinkage == 'none') {
      pair_result <- DESeq2::results(deseq_result, contrast = c('group', treat_1, treat_2), ...)
    } else {
      pair_result <- DESeq2::lfcShrink(deseq_result, contrast = c('group', treat_1, treat_2),
                                       type = lfc_shrinkage, ...)
    }

    # Add treatments compared
    output <- cbind(data.frame(stringsAsFactors = FALSE,
                               treatment_1 = treat_1,
                               treatment_2 = treat_2),
                    as.data.frame(pair_result))
    
    # Add in other columns if specified
    output <- cbind(abund_data[, other_cols], output)
    
    return(output)
  }
  
  # Compare all pairs of treatments and combine
  output <- lapply(seq_len(length(combinations)), function(i) {
    one_comparison(combinations[[i]][1], combinations[[i]][2])
  })
  output <- do.call(rbind, output)
  
  # Convert to tibble and return
  dplyr::as.tbl(output)
}

zachary-foster avatar May 01 '20 19:05 zachary-foster

I want to look at deferentially abundant ASVs. In fact, yesterday I tried doing it addind ASVs as the last "taxonomic" rank, and it worked (see pdf attached). However, as I commented before I cannot understand why I got 2971 taxa from an physeq with 2106 ASVs. pyobj <- readRDS('/Downloads/phyloseq_caliginosaNOPOOL.rds') datos2<-pyobj ### 2106 ASVs Species<-rownames(tax_table(datos2)) tax_table(datos2)<-cbind(tax_table(datos2),Species) para_deseq21<-parse_phyloseq(datos2) ## 2971 taxa and 2106 ASVs para_deseq21$data$tax_table <- calc_taxon_abund(para_deseq21, data = "otu_table", cols = para_soilgut$data$sample_data$sample_id) para_deseq21$data$diff_table <- calc_diff_abund_deseq2(para_deseq21, data = "tax_table", cols = para_soilgut$data$sample_data$sample_id, groups = para_soilgut$data$sample_data$type)

para_deseq21$data$diff_table$padjustBIEN<-p.adjust(para_deseq21$data$diff_table$pvalue,method="BH") para_deseq21$data$diff_table$log2FoldChange[para_deseq21$data$diff_table$padjustBIEN > 0.05] <- 0

I prefer to plot these data highlighting only positive and negative ASVs and left to supplementary a classical MAplot or the output table. heat_tree_matrix(para_deseq21, data = "diff_table", node_size = n_obs, node_label = taxon_names, node_color = ifelse(log2FoldChange == 0, "grey", ifelse(log2FoldChange<0, "navajowhite3","olivedrab4")), #node_color_range = diverging_palette(), #node_color_trans = "linear", #node_color_interval = c(-8, 27), # min and max from summary above #edge_color_interval = c(-8, 27), node_size_axis_label = "Number of OTUs", node_color_axis_label = "Log2 fold change", edge_label = n_obs, seed = 1, layout = "davidson-harel", initial_layout = "reingold-tilford")

I would suggest to "re-adjust" p-values due to multiple comparisons, maybe internally to function (output $padjustOK<-p.adjust(output $pvalue,method="fdr")), as a warning in the output ( # Convert to tibble and return dplyr::as.tbl(output) print "You should consider to adjutst your p-values due to multiple comparisons"), or as an option in the function (calc_diff_abund_deseq2 <- function(obj, data, cols, groups, other_cols = FALSE, multp_comp_correct,...). This could take TRUE (default) or (FALSE) and then apply or not the padjust before tibbleling the output

I prefer to plot these data

deseq2_plot_metacoderOK.pdf

pedres avatar May 02 '20 09:05 pedres

. However, as I commented before I cannot understand why I got 2971 taxa from an physeq with 2106 ASVs.

I am not sure I have the same data set as you. I see 6452 ASVs:

> dim(datos2@otu_table)
[1]   48 6452

If you included ASV as a rank in the taxonomy, then it would be expected that you wuld get more taxa than ASVs, but each ASV is a taxon and you still have all the internal taxa (e.g. "Streptomyces"). A "taxon" in taxmap object is each unique classification, including taxa on the inside of the tree, even if no data is assigned to them, whereas in phyloseq, a row in the taxonomy table is a classificaiton with multiple taxa. Foe example

        Kingdom    Phylum         Class          Order          Family         Genus         
ASV3885 "Bacteria" "unclassified" "unclassified" "unclassified" "unclassified" "unclassified"

is one row in a phyloseq object but is 6 taxa in a taxmap object:

  • "Bacteria"
  • "Bacteria" "unclassified"
  • "Bacteria" "unclassified" "unclassified"
  • "Bacteria" "unclassified" "unclassified" "unclassified"
  • "Bacteria" "unclassified" "unclassified" "unclassified" "unclassified"
  • "Bacteria" "unclassified" "unclassified" "unclassified" "unclassified" "unclassified"

I prefer to plot these data highlighting only positive and negative ASVs and left to supplementary a classical MAplot or the output table.

You can filter taxa based on the p-value before plotting to do this.

I would suggest to "re-adjust" p-values due to multiple comparisons, maybe internally to function (output $padjustOK<-p.adjust(output $pvalue,method="fdr")), as a warning in the output ( # Convert to tibble and return dplyr::as.tbl(output) print "You should consider to adjutst your p-values due to multiple comparisons"), or as an option in the function (calc_diff_abund_deseq2 <- function(obj, data, cols, groups, other_cols = FALSE, multp_comp_correct,...). This could take TRUE (default) or (FALSE) and then apply or not the padjust before tibbleling the output

Yea, thats a good idea

zachary-foster avatar May 04 '20 22:05 zachary-foster