metacoder
metacoder copied to clipboard
parse_phyloseq and n_obs as node_size question
Hi, I am just started to work with metacoder, but I have a simple question. When I import a phyloseq object with parse phyloseq I can see the all components correctly imported. However, in the otu_table there is a taxon_id column., and the number of taxon_ids (865) is lower than the number of ASVs (2106) phyloseq_caliginosaNOPOOL.zip
# A tibble: 6,217 x 49
taxon_id otu_id DL151 DL152 DL152g
<chr> <chr> <int> <int> <int>
1 asd ASV17~ 0 0 0
2 asd ASV352 0 0 0
3 asd ASV39~ 0 0 0
When I look to the tax_table of physeq object these three ASVs do not have the same taxonomy ASV17 Bacteria Acidobacteria Subgroup_6 unclassified unclassified unclassified ASV39 Bacteria Bacteroidetes Bacteroidia Flavobacteriales Flavobacteriaceae Flavobacterium ASV352 Bacteria WPS-2 unclassified unclassified unclassified unclassified
What taxon_id represent? I think that there is not any problem since heat_trees plot represent monophyletic bacterial clades.
Another question is the node_size option. I am following the example tutorial to compare any number of treatments or groups
para_metacoder<-parse_phyloseq(datos_filt) ## attached as zip para_metacoder$data$tax_abund <- calc_taxon_abund(para_metacoder, "otu_table") para_metacoder$data$type_abund <- calc_group_mean(para_metacoder, "tax_abund",cols = para_metacoder$data$sample_data$sample_id, groups =para_metacoder$data$sample_data$type)
para_metacoder$data$tax_occ <- calc_n_samples(para_metacoder, "tax_abund", cols = para_metacoder$data$sample_data$sample_id, groups =para_metacoder$data$sample_data$type)
heat_tree(para_metacoder, node_label = taxon_names, node_size = n_obs, node_color = soil, node_size_axis_label = "OTU count", node_color_axis_label = "Samples with reads", layout = "davidson-harel", # The primary layout algorithm initial_layout = "reingold-tilford")
this first plot works perfectly
para_metacoder$data$diff_table <- compare_groups(para_metacoder, data = "tax_abund", cols = para_metacoder$data$sample_data$sample_id, groups = para_metacoder$data$sample_data$type)
para_metacoder <- mutate_obs(para_metacoder, "diff_table", wilcox_p_value = p.adjust(wilcox_p_value, method = "fdr"))
para_metacoder$data$diff_table$log2_median_ratio[para_metacoder$data$diff_table$wilcox_p_value > 0.05] <- 0
para_metacoder %>%
taxa::filter_taxa(taxon_ranks == "Class", supertaxa = TRUE, reassign_obs = FALSE) %>% ### pongo mutate_obs("cleaned_names", gsub(taxon_names, pattern = "\[|\]", replacement = "")) %>%
taxa::filter_taxa(grepl(cleaned_names, pattern = "^[a-zA-Z]+$"), reassign_obs = FALSE) %>%
heat_tree_matrix(data = "diff_table",
node_label = cleaned_names,
node_size = n_obs, # number of OTUs
node_color = log2_median_ratio, # difference between groups
node_color_trans = "linear",
node_color_interval = c(-8, 8), # symmetric interval
edge_color_interval = c(-8, 8), # symmetric interval
node_color_range = diverging_palette(), # diverging colors
node_size_axis_label = "OTU count",
node_color_axis_label = "Log 2 ratio of median counts",
layout = "da", initial_layout = "re",
key_size = 0.67,
seed = 2)
This plots the main heat_tree and the pairwise comparison heat_trees but all nodes have the same size (see attached plot). I cannot find were is the mistake I am doing.
Thank you very much in advance,
Manuel
When I import a phyloseq object with parse phyloseq I can see the all components correctly imported. However, in the otu_table there is a taxon_id column., and the number of taxon_ids (865) is lower than the number of ASVs (2106)
This is normal, because there can be multiple ASVs assigned to the same taxon. Each unique taxon is stored once in the object, unlike phyloseq's taxonomy table, which has is one row per ASV and the same value may appear in multiple rows.
When I look to the tax_table of physeq object these three ASVs do not have the same taxonomy
I could not replicate that problem with the data you sent me
> library(metacoder)
> pyobj <- readRDS('~/Downloads/phyloseq_caliginosaNOPOOL.rds')
> tmobj <- parse_phyloseq(pyobj)
> pyobj@tax_table[1:4, ]
Taxonomy Table: [4 taxa by 6 taxonomic ranks]:
Kingdom Phylum Class Order Family Genus
ASV3885 "Bacteria" "unclassified" "unclassified" "unclassified" "unclassified" "unclassified"
ASV6082 "Bacteria" "unclassified" "unclassified" "unclassified" "unclassified" "unclassified"
ASV2659 "Bacteria" "WPS-2" "unclassified" "unclassified" "unclassified" "unclassified"
ASV1700 "Bacteria" "WPS-2" "unclassified" "unclassified" "unclassified" "unclassified"
> tmobj$data$otu_table[1:4, 1:2]
# A tibble: 4 x 2
taxon_id otu_id
<chr> <chr>
1 bba ASV3885
2 bba ASV6082
3 bbb ASV2659
4 bbb ASV1700
> classifications(tmobj)[tmobj$data$otu_table$taxon_id[1:4]]
bba bba
"Bacteria;unclassified;unclassified;unclassified;unclassified;unclassified" "Bacteria;unclassified;unclassified;unclassified;unclassified;unclassified"
bbb bbb
"Bacteria;WPS-2;unclassified;unclassified;unclassified;unclassified" "Bacteria;WPS-2;unclassified;unclassified;unclassified;unclassified"
If you send me code to replicate the problem, I can look into it more.
What taxon_id represent?
It is a unique identifier made for each unique classification. It is used instead of taxon name because the same name can be used for different taxa. For example, an "unclassified" bacteria is different than a "unclassified" fungi.
This plots the main heat_tree and the pairwise comparison heat_trees but all nodes have the same size (see attached plot). I cannot find were is the mistake I am doing.
That is odd. The legend should reflect the size range but is does not. Anyway, I think you issue is that you filtered out all of you ASVs from your "otu_table" with the reassign_obs = FALSE
and n_obs
is counting the number of rows in that table (because it is the first table) for each taxon, but it all 0. reassign_obs = FALSE
is needed when filtering taxon data for plotting, like that in diff_table
, but is not what you want for things like OTUs/ASVs, where you want them reassinged to existing taxa if some ranks are filtered out. Unfortunatly you need both type of data for that plot so you will need the more complicated reassign_obs = c(diff_table = FALSE)
. Try that and let me know if that fixes the issue.
Hi,
Thank you very much for your quick answer. It works perfectly. The former problem regarding taxon_id and otu_id classification was because I assume that ASV17~ was ASV17 and not ASV1700 (for example). I was also trying to adapt the diff_table procedure to include DESEq2 output, in a similar way as in issue #240 ERRROR in heat_tree_matrix - missing value where TRUE/FALSE needed. I am using the same data as in the former question plus the deseq dataset attacched But despite I have a similar table I get two different errors when plotting. . I was trying to find errors due to mitchmatch in taxon_id between the taxmap and the diff_table made from deseq2 output.
Code: res_deseq2<-readRDS("deseq2_object.rds")
maike pairwise comparisons
soil_vs_gut<-DESeq2::results(res_deseq2,cooksCutoff = FALSE, contrast=c("type","gut","soil")) ### soil_vs_gut$otu_id<-rownames(soil_vs_gut) soil_vs_gut$term<-rep("soil_vs_gut",length(rownames(soil_vs_gut))) soil_vs_gut<-data.frame(soil_vs_gut)
gut_vs_cast<-DESeq2::results(res_deseq2,cooksCutoff = FALSE, contrast=c("type","cast","gut")) ### gut_vs_cast$otu_id<-rownames(soil_vs_gut) gut_vs_cast$term<-rep("gut_vs_cast",length(rownames(gut_vs_cast))) gut_vs_cast<-data.frame(gut_vs_cast)
soil_vs_cast<-DESeq2::results(res_deseq2,cooksCutoff = FALSE, contrast=c("type","cast","soil")) ### soil_vs_cast$otu_id<-rownames(soil_vs_cast) soil_vs_cast$term<-rep("soil_vs_cast",length(rownames(soil_vs_cast))) soil_vs_cast<-data.frame(soil_vs_cast)
make taxmap object
pyobj <- readRDS('~/Downloads/phyloseq_caliginosaNOPOOL.rds') para_deseq2<-parse_phyloseq(pyobj)
get taxon_id and otu_id data
taxons<-data.frame(para_deseq2$data$otu_table) taxons<-taxons[,1:2]
arrange data to paste data.frames
taxons<-plyr::arrange(taxons, otu_id) soil_vs_gut<-plyr::arrange(soil_vs_gut, otu_id) soil_vs_cast<-plyr::arrange(soil_vs_cast, otu_id) gut_vs_cast<-plyr::arrange(gut_vs_cast, otu_id)
soil_vs_gut<-data.frame(soil_vs_gut,taxons) soil_vs_cast<-data.frame(soil_vs_cast,taxons) gut_vs_cast<-data.frame(gut_vs_cast,taxons)
res_deseq2<-data.frame(rbind(soil_vs_gut,soil_vs_cast,gut_vs_cast)) res_deseq2$treatment_1 <- gsub("vs.+$", "", res_deseq2$term) ## split pairwise factor "term" into its two components res_deseq2$treatment_2 <- gsub("^.+vs", "", res_deseq2$term) res_deseq2$padjustBIEN<-p.adjust(res_deseq2$pvalue,method="BH") res_deseq2$log2FoldChange[res_deseq2$padjustBIEN > 0.05] <- 0
get intestering columns
res_deseq2<-res_deseq2[,c(9,11,12,2)]
make diff_table
diff_deseq2<-tibble::as_tibble(res_deseq2)
check taxon_ids are the same across all data in taxmap object
para_deseq2$data$tax_abund <- calc_taxon_abund(para_deseq2, "otu_table", cols = para_deseq2$data$sample_data$sample_id) para_deseq2$data$type_abund <- calc_group_mean(para_deseq2, "tax_abund",cols = para_deseq2$data$sample_data$sample_id, groups =para_deseq2$data$sample_data$type) para_deseq2$data$diff_table<-diff_deseq2 para_deseq2$data$tax_abund<-dplyr::arrange(para_deseq2$data$tax_abund,taxon_id) head(para_deseq2$data$tax_abund) para_deseq2$data$type_abund<-dplyr::arrange(para_deseq2$data$type_abund,taxon_id) head(para_deseq2$data$type_abund) para_deseq2$data$diff_table<-dplyr::arrange(para_deseq2$data$diff_table,taxon_id) head(para_deseq2$data$diff_table) para_deseq2$data$otu_table<-dplyr::arrange(para_deseq2$data$otu_table,taxon_id) head(para_deseq2$data$otu_table)
check taxon_id in the original taxmap object is the same in diff_table
identical(sort(unique(para_deseq2$data$diff_table$taxon_id)),sort(unique(para_deseq2$data$otu_table$taxon_id)))
TRUE
heat_tree_matrix(para_deseq2, data = "diff_table", node_label = taxon_names, node_size = n_obs, # number of OTUs node_color = log2FoldChange, # difference between groups node_color_trans = "linear", node_color_interval = c(-8, 8), # symmetric interval edge_color_interval = c(-8, 8), # symmetric interval node_color_range = diverging_palette(), # diverging colors node_size_axis_label = "OTU count", node_color_axis_label = "Log 2 ratio of median counts", layout = "da", initial_layout = "re", key_size = 0.67, seed = 2)
Error: All pairs being compared should have one value per taxon. The following do not: soil vs. gut (2106), soil vs. cast (2106), gut vs. cast (2106)
But when type table(paste(para_deseq2$data$diff_table$treatment_1, para_deseq2$data$diff_table$treatment_2))
gut cast soil cast soil gut 2106 2106 2106
para_deseq2 %>% taxa::filter_taxa(taxon_ranks == "Class", supertaxa = TRUE, reassign_obs = c(diff_table = FALSE)) %>% heat_tree_matrix(data = "diff_table", node_label = taxon_names, node_size = n_obs, # number of OTUs node_color = log2FoldChange, # difference between groups node_color_trans = "linear", node_color_interval = c(-8, 8), # symmetric interval edge_color_interval = c(-8, 8), # symmetric interval node_color_range = diverging_palette(), # diverging colors node_size_axis_label = "OTU count", node_color_axis_label = "Log 2 ratio of median counts", layout = "da", initial_layout = "re", key_size = 0.67, seed = 2)
Error in utils::combn(seq_along(treatments), 2) : n < m In addition: Warning messages: 1: There is no "taxon_id" column in the data set "3", so there are no taxon IDs. 2: The data set "4" is named, but not named by taxon ids.
Hi,
I finally was able to solve it following the question and solution posted in thread #240
Good, let me know if you have more questions.
Hi Zachary,
I wonder if the approach I used is the same as running deseq from a physeq object because running deseq from physeq object you get a results table with 2106 ASVs x 3 comparisons (a factor with three levels) but running deseq from metacoder I get a table with 2,595 rows (865 taxa x 3 comparisons). I have tried to adapt the code below to use the taxmap otu_table in the deseq analysis and it works but when I tried to plot it I get the error Error: All pairs being compared should have one value per taxon. The following do not: soil vs. gut (2106), soil vs. cast (2106), gut vs. cast (2106)
Curiosly, when I add a "Species" level in the tax_table, which are the ASVs identities it works, but I would expect to find 2106 taxon_ids, which in fact I have, but I have 2971 taxa. This works if I use a tax_abund table, but again I do not know why I have 2971 taxa instead the original 2106 ASVs.
datos2<-pyobj Species<-rownames(tax_table(datos2)) tax_table(datos2)<-cbind(tax_table(datos2),Species) para_deseq21<-parse_phyloseq(datos2) para_deseq21$data$tax_abund2 <- calc_taxon_abund(para_deseq21, "otu_table",cols = para_deseq21$data$sample_data$sample_id) sd2 <- para_deseq21$data$sample_data deseq_tax_data2 <- DESeq2::DESeqDataSetFromMatrix(countData = para_deseq21$data$tax_abund2 [,-1], colData = sd, design = ~ type) gm_mean = function(x, na.rm=TRUE){ exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x)) } geoMeans_reactoresDES_X2<-apply(DESeq2::counts(deseq_tax_data2), 1, gm_mean) deseq_tax_data2 = DESeq2::estimateSizeFactors(deseq_tax_data2,type = "poscount", geoMeans = geoMeans_reactoresDES_X2) deseq_tax_data2<-DESeq2::DESeq(deseq_tax_data2, test="Wald", fitType="parametric",parallel=F)
tax_GvS12 <- cbind(taxon_id = para_deseq21$data$tax_abund2$taxon_id, # taxon id needed for plotting DESeq2::results(deseq_tax_data2, cooksCutoff = FALSE, contrast=c("type","gut","soil"))) tax_CvS12 <- cbind(taxon_id = para_deseq21$data$tax_abund2$taxon_id, # taxon id needed for plotting DESeq2::results(deseq_tax_data2, cooksCutoff = FALSE, contrast=c("type","cast","soil"))) tax_GvC12 <- cbind(taxon_id = para_deseq21$data$tax_abund2$taxon_id, # taxon id needed for plotting DESeq2::results(deseq_tax_data2, cooksCutoff = FALSE, contrast=c("type","cast","gut")))
tax_GvS12 <- tidyr::as_tibble(tax_GvS12) tax_GvS12$treatment_1 <- "gut" tax_GvS12$treatment_2 <- "soil"
tax_CvS12 <- tidyr::as_tibble(tax_CvS12) tax_CvS12$treatment_1 <- "cast" tax_CvS12$treatment_2 <- "soil"
tax_GvC12 <- tidyr::as_tibble(tax_GvC12) tax_GvC12$treatment_1 <- "cast" tax_GvC12$treatment_2 <- "gut"
create new list for para_deseq2$data
testM12 <- do.call ("rbind",list(tax_GvS12,tax_CvS12,tax_GvC12)) testM12$treatment_1 <- as.factor (testM12$treatment_1) # tried also as letting them as.character, did not work testM12$treatment_2 <- as.factor (testM12$treatment_2) testM12$padjustBIEN<-p.adjust(testM12$pvalue,method="BH") testM12$log2FoldChange[testM12$padjustBIEN > 0.05] <- 0 #assign to obj$data para_soilgut$data$testM <- testM12
adapted code
pyobj <- readRDS('~/Downloads/phyloseq_caliginosaNOPOOL.rds') para_deseq2<-parse_phyloseq(pyobj ) para_deseq2$data$tax_abund <- calc_taxon_abund(para_deseq2, "otu_table", cols = para_deseq2$data$sample_data$sample_id) pyobj <- readRDS('~/Downloads/phyloseq_caliginosaNOPOOL.rds') para_deseq2<-parse_phyloseq(pyobj) sd <- para_deseq2$data$sample_data all(sd$sample_id == colnames(para_deseq2$data$tax_abund)[-1])
Convert data to DESeq2 format and do differential abundance analysis
deseq_tax_data <- DESeq2::DESeqDataSetFromMatrix(countData = para_deseq2$data$tax_abund [-1], colData = sd, design = ~ type)
gm_mean = function(x, na.rm=TRUE){ exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x)) } geoMeans_reactoresDES_X<-apply(DESeq2::counts(deseq_tax_data), 1, gm_mean) deseq_tax_data = DESeq2::estimateSizeFactors(deseq_tax_data,type = "poscount", geoMeans = geoMeans_reactoresDES_X) deseq_tax_data<-DESeq2::DESeq(deseq_tax_data, test="Wald", fitType="parametric",parallel=F)
tax_GvS <- cbind(taxon_id = para_deseq2$data$tax_abund$taxon_id, # taxon id needed for plotting DESeq2::results(deseq_tax_data, cooksCutoff = FALSE, contrast=c("type","gut","soil"))) tax_CvS <- cbind(taxon_id = para_deseq2$data$tax_abund$taxon_id, # taxon id needed for plotting DESeq2::results(deseq_tax_data, cooksCutoff = FALSE, contrast=c("type","cast","soil"))) tax_GvC <- cbind(taxon_id = para_deseq2$data$tax_abund$taxon_id, # taxon id needed for plotting DESeq2::results(deseq_tax_data, cooksCutoff = FALSE, contrast=c("type","cast","gut")))
tax_GvS <- tidyr::as_tibble(tax_GvS) tax_GvS$treatment_1 <- "gut" tax_GvS$treatment_2 <- "soil"
tax_CvS <- tidyr::as_tibble(tax_CvS) tax_CvS$treatment_1 <- "cast" tax_CvS$treatment_2 <- "soil"
tax_GvC <- tidyr::as_tibble(tax_GvC) tax_GvC$treatment_1 <- "cast" tax_GvC$treatment_2 <- "gut"
create new list for para_deseq2$data
testM <- do.call ("rbind",list(tax_GvS,tax_CvS,tax_GvC)) testM$treatment_1 <- as.factor (testM$treatment_1) # tried also as letting them as.character, did not work testM$treatment_2 <- as.factor (testM$treatment_2) testM$padjustBIEN<-p.adjust(testM$pvalue,method="BH") testM$log2FoldChange[testM$padjustBIEN > 0.05] <- 0 #assign to obj$data para_deseq2$data$testM <- testM
adapted code
sd <- para_deseq2$data$sample_data deseq_tax_data1 <- DESeq2::DESeqDataSetFromMatrix(countData = para_deseq2$data$otu_table [,-(1:2)], colData = sd, design = ~ type)
gm_mean = function(x, na.rm=TRUE){ exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x)) } geoMeans_reactoresDES_X1<-apply(DESeq2::counts(deseq_tax_data1), 1, gm_mean) deseq_tax_data1 = DESeq2::estimateSizeFactors(deseq_tax_data1,type = "poscount", geoMeans = geoMeans_reactoresDES_X1) deseq_tax_data1<-DESeq2::DESeq(deseq_tax_data1, test="Wald", fitType="parametric",parallel=F)
tax_GvS1 <- cbind(taxon_id = para_deseq21$data$otu_table$taxon_id, # taxon id needed for plotting DESeq2::results(deseq_tax_data1, cooksCutoff = FALSE, contrast=c("type","gut","soil"))) tax_CvS1 <- cbind(taxon_id = para_deseq21$data$otu_table$taxon_id, # taxon id needed for plotting DESeq2::results(deseq_tax_data1, cooksCutoff = FALSE, contrast=c("type","cast","soil"))) tax_GvC1 <- cbind(taxon_id = para_deseq21$data$otu_table$taxon_id, # taxon id needed for plotting DESeq2::results(deseq_tax_data1, cooksCutoff = FALSE, contrast=c("type","cast","gut")))
tax_GvS1 <- tidyr::as_tibble(tax_GvS1) tax_GvS1$treatment_1 <- "gut" tax_GvS1$treatment_2 <- "soil"
tax_CvS1 <- tidyr::as_tibble(tax_CvS1) tax_CvS1$treatment_1 <- "cast" tax_CvS1$treatment_2 <- "soil"
tax_GvC1 <- tidyr::as_tibble(tax_GvC1) tax_GvC1$treatment_1 <- "cast" tax_GvC1$treatment_2 <- "gut"
create new list for para_deseq2$data
testM1 <- do.call ("rbind",list(tax_GvS1,tax_CvS1,tax_GvC1)) testM1$treatment_1 <- as.factor (testM1$treatment_1) # tried also as letting them as.character, did not work testM1$treatment_2 <- as.factor (testM1$treatment_2) testM1$padjustBIEN<-p.adjust(testM1$pvalue,method="BH")
I think the issue is that you are trying to plot per-ASV results on a heat tree, which only supports per-taxon data. Which is your goal? To look at deferentially abundant taxa or deferentially abundant ASVs? If ASVs, how do you want to encode that information per-taxon? Mean difference? Median? Number of deferentially abundant ASVs per taxon, etc?
A way I like to approach this is to add the ASV as a level in the taxonomy when you do the analysis, so that below species is a "taxonomic rank" called ASV. So before you parse your data to a taxmap
, add another taxonomic rank to the input data and put the names of the ASVs in there and you will do both the per-taxon and per-ASV analysis at the same time.
You can try this code I made to run deseq2 from a taxmap object. I have not gotten around to testing it enough to include it in the package yet, but it seems to work as far as I can tell. Might give you some ideas anyway.
#' Differential abundance with DESeq2
#'
#' EXPERIMENTAL: This function is still being tested and developed; use with caution. Uses the
#' \code{\link{DESeq2}} package to conduct differential abundance analysis of count data. Counts can
#' be of OTUs/ASVs or taxa. The plotting function \code{\link{heat_tree_matrix}} is useful for
#' visualizing these results. See details section below for considerations on preparing data for
#' this analysis.
#'
#' Data should be raw read counts, not rarefied, converted to proportions, or modified with any
#' other technique designed to correct for sample size since \code{\link{DESeq2}} is designed to be
#' used with count data and takes into accoutn unequal sample size when determining differential
#' abundance. Warnings will be given if the data is not integers or all sample sizes are equal.
#'
#' @param obj A \code{\link[taxa]{taxmap}} object
#' @param data The name of a table in \code{obj} that contains data for each sample in columns.
#' @param cols The names/indexes of columns in \code{data} to use. By default, all numeric columns
#' are used. Takes one of the following inputs: \describe{ \item{TRUE/FALSE:}{All/No columns will
#' used.} \item{Character vector:}{The names of columns to use} \item{Numeric vector:}{The indexes
#' of columns to use} \item{Vector of TRUE/FALSE of length equal to the number of columns:}{Use
#' the columns corresponding to \code{TRUE} values.} }
#' @param groups A vector defining how samples are grouped into "treatments". Must be the same order
#' and length as \code{cols}.
#' @param other_cols If \code{TRUE}, preserve all columns not in \code{cols} in the output. If
#' \code{FALSE}, dont keep other columns. If a column names or indexes are supplied, only preserve
#' those columns.
#' @param lfc_shrinkage What technique to use to adjust the log fold change results for low counts.
#' Useful for ranking and visualizing log fold changes. Must be one of the following:
#' \describe{
#' \item{'none'}{No log fold change adjustments.}
#' \item{'normal'}{The original DESeq2 shrinkage estimator}
#' \item{'ashr'}{Adaptive shrinkage estimator from the \code{\link{ashr}} package, using a fitted mixture of normals prior.}
#' }
#' @param ... Passed to \code{\link[DESeq2]{results}} if the \code{lfc_shrinkage} option is "none"
#' and to \code{\link[DESeq2]{lfcShrink}} otherwise.
#'
#' @return A tibble with at least the taxon ID of the thing tested, the groups compared, and the
#' DESeq2 results. The \code{log2FoldChange} values will be positive if \code{treatment_1} is more
#' abundant and \code{treatment_2}.
#'
#' @family calculations
#'
#' @examples
#' \dontrun{
#' # Parse data for plotting
#' x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";",
#' class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"),
#' class_regex = "^(.+)__(.+)$")
#'
#' # Get per-taxon counts
#' x$data$tax_table <- calc_taxon_abund(x, data = "tax_data", cols = hmp_samples$sample_id)
#'
#' # Calculate difference between groups
#' x$data$diff_table <- calc_diff_abund_deseq2(x, data = "tax_table",
#' cols = hmp_samples$sample_id,
#' groups = hmp_samples$body_site)
#'
#' # Plot results (might take a few minutes)
#' heat_tree_matrix(x,
#' data = "diff_table",
#' node_size = n_obs,
#' node_label = taxon_names,
#' node_color = ifelse(is.na(padj) | padj > 0.05, 0, log2FoldChange),
#' node_color_range = diverging_palette(),
#' node_color_trans = "linear",
#' node_color_interval = c(-3, 3),
#' edge_color_interval = c(-3, 3),
#' node_size_axis_label = "Number of OTUs",
#' node_color_axis_label = "Log2 fold change")
#'
#' }
#'
#' @export
calc_diff_abund_deseq2 <- function(obj, data, cols, groups, other_cols = FALSE,
lfc_shrinkage = c('none', 'normal', 'ashr'), ...) {
# Check that DESeq2 is intalled
if (! requireNamespace("DESeq2", quietly = TRUE)) {
stop('The "DESeq2" package needs to be installed for this function to work.',
call. = FALSE)
}
# Parse options
lfc_shrinkage <- match.arg(lfc_shrinkage)
# Get abundance by sample data
abund_data <- metacoder:::get_taxmap_table(obj, data)
# Parse columns to use
cols <- metacoder:::get_numeric_cols(obj, data, cols)
# Check that columns contain integers only
col_is_int <- vapply(cols, FUN.VALUE = logical(1), function(c) {
all(abund_data[[c]] %% 1 == 0)
})
non_int_cols <- cols[! col_is_int]
if (length(non_int_cols) > 0) {
stop(call. = FALSE,
'All data given to DESeq2 should be untransformed counts. The following columns contain non-integers:\n',
limited_print(type = 'silent', non_int_cols, prefix = ' '))
}
# Check for equal sample sizes
if (length(unique(colSums(abund_data[cols]))) == 1) {
warning(call. = FALSE,
'All columns have equal counts, suggesting counts were normalized (e.g. rarefied) to correct for sample size variation.',
' DESeq2 is designed to be used with unnormalized data.',
' Use untransformed counts if available.')
}
# Check groups option
groups <- metacoder:::check_option_groups(groups, cols)
# Find other columns
# These might be added back to the output later
other_cols <- metacoder:::get_taxmap_other_cols(obj, data, cols, other_cols)
# Get every combination of groups to compare
combinations <- t(utils::combn(unique(groups), 2))
combinations <- lapply(seq_len(nrow(combinations)), function(i) combinations[i, ])
# Format data for DESeq2
metadata <- data.frame(group = groups)
deseq_data <- DESeq2::DESeqDataSetFromMatrix(countData = abund_data[cols],
colData = metadata,
design = ~ group)
# Run DESeq2
deseq_result <- DESeq2::DESeq(deseq_data)
# Make function to compare one pair of groups
one_comparison <- function(treat_1, treat_2) {
# Extract results for a single pair
if (lfc_shrinkage == 'none') {
pair_result <- DESeq2::results(deseq_result, contrast = c('group', treat_1, treat_2), ...)
} else {
pair_result <- DESeq2::lfcShrink(deseq_result, contrast = c('group', treat_1, treat_2),
type = lfc_shrinkage, ...)
}
# Add treatments compared
output <- cbind(data.frame(stringsAsFactors = FALSE,
treatment_1 = treat_1,
treatment_2 = treat_2),
as.data.frame(pair_result))
# Add in other columns if specified
output <- cbind(abund_data[, other_cols], output)
return(output)
}
# Compare all pairs of treatments and combine
output <- lapply(seq_len(length(combinations)), function(i) {
one_comparison(combinations[[i]][1], combinations[[i]][2])
})
output <- do.call(rbind, output)
# Convert to tibble and return
dplyr::as.tbl(output)
}
I want to look at deferentially abundant ASVs. In fact, yesterday I tried doing it addind ASVs as the last "taxonomic" rank, and it worked (see pdf attached). However, as I commented before I cannot understand why I got 2971 taxa from an physeq with 2106 ASVs. pyobj <- readRDS('/Downloads/phyloseq_caliginosaNOPOOL.rds') datos2<-pyobj ### 2106 ASVs Species<-rownames(tax_table(datos2)) tax_table(datos2)<-cbind(tax_table(datos2),Species) para_deseq21<-parse_phyloseq(datos2) ## 2971 taxa and 2106 ASVs para_deseq21$data$tax_table <- calc_taxon_abund(para_deseq21, data = "otu_table", cols = para_soilgut$data$sample_data$sample_id) para_deseq21$data$diff_table <- calc_diff_abund_deseq2(para_deseq21, data = "tax_table", cols = para_soilgut$data$sample_data$sample_id, groups = para_soilgut$data$sample_data$type)
para_deseq21$data$diff_table$padjustBIEN<-p.adjust(para_deseq21$data$diff_table$pvalue,method="BH") para_deseq21$data$diff_table$log2FoldChange[para_deseq21$data$diff_table$padjustBIEN > 0.05] <- 0
I prefer to plot these data highlighting only positive and negative ASVs and left to supplementary a classical MAplot or the output table. heat_tree_matrix(para_deseq21, data = "diff_table", node_size = n_obs, node_label = taxon_names, node_color = ifelse(log2FoldChange == 0, "grey", ifelse(log2FoldChange<0, "navajowhite3","olivedrab4")), #node_color_range = diverging_palette(), #node_color_trans = "linear", #node_color_interval = c(-8, 27), # min and max from summary above #edge_color_interval = c(-8, 27), node_size_axis_label = "Number of OTUs", node_color_axis_label = "Log2 fold change", edge_label = n_obs, seed = 1, layout = "davidson-harel", initial_layout = "reingold-tilford")
I would suggest to "re-adjust" p-values due to multiple comparisons, maybe internally to function (output $padjustOK<-p.adjust(output $pvalue,method="fdr")), as a warning in the output ( # Convert to tibble and return dplyr::as.tbl(output) print "You should consider to adjutst your p-values due to multiple comparisons"), or as an option in the function (calc_diff_abund_deseq2 <- function(obj, data, cols, groups, other_cols = FALSE, multp_comp_correct,...). This could take TRUE (default) or (FALSE) and then apply or not the padjust before tibbleling the output
I prefer to plot these data
. However, as I commented before I cannot understand why I got 2971 taxa from an physeq with 2106 ASVs.
I am not sure I have the same data set as you. I see 6452 ASVs:
> dim(datos2@otu_table)
[1] 48 6452
If you included ASV as a rank in the taxonomy, then it would be expected that you wuld get more taxa than ASVs, but each ASV is a taxon and you still have all the internal taxa (e.g. "Streptomyces"). A "taxon" in taxmap
object is each unique classification, including taxa on the inside of the tree, even if no data is assigned to them, whereas in phyloseq, a row in the taxonomy table is a classificaiton with multiple taxa. Foe example
Kingdom Phylum Class Order Family Genus
ASV3885 "Bacteria" "unclassified" "unclassified" "unclassified" "unclassified" "unclassified"
is one row in a phyloseq object but is 6 taxa in a taxmap
object:
- "Bacteria"
- "Bacteria" "unclassified"
- "Bacteria" "unclassified" "unclassified"
- "Bacteria" "unclassified" "unclassified" "unclassified"
- "Bacteria" "unclassified" "unclassified" "unclassified" "unclassified"
- "Bacteria" "unclassified" "unclassified" "unclassified" "unclassified" "unclassified"
I prefer to plot these data highlighting only positive and negative ASVs and left to supplementary a classical MAplot or the output table.
You can filter taxa based on the p-value before plotting to do this.
I would suggest to "re-adjust" p-values due to multiple comparisons, maybe internally to function (output $padjustOK<-p.adjust(output $pvalue,method="fdr")), as a warning in the output ( # Convert to tibble and return dplyr::as.tbl(output) print "You should consider to adjutst your p-values due to multiple comparisons"), or as an option in the function (calc_diff_abund_deseq2 <- function(obj, data, cols, groups, other_cols = FALSE, multp_comp_correct,...). This could take TRUE (default) or (FALSE) and then apply or not the padjust before tibbleling the output
Yea, thats a good idea