metacoder
metacoder copied to clipboard
filtering on Log2FoldChange (question)
Hi again,
first of all, this package is amazing!
I'm dealing with heat_tree_matrix on data obtained with calc_diff_abund_deseq2; what should I do on the diff_table in order to
keep in consideration only those taxa with Log2FoldChange > 2.5 (upregulated) and < -2.5 (downregulated)?
Many thanks in advance! Cheers Stefano
Hi Stefano,
Thanks!
Since heat tree matrix supports multiple comparisons you have to consider how to handle cases where a taxon has higher than a 2.5 fold change in one comparison but not another. The example below includes taxa if they are > 2.5 in any comparison.
Try this:
library(metacoder)
# Parse data for plotting
x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";",
class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"),
class_regex = "^(.+)__(.+)$")
# Get per-taxon counts
x$data$tax_table <- calc_taxon_abund(x, data = "tax_data", cols = hmp_samples$sample_id)
# Calculate difference between groups
x$data$diff_table <- calc_diff_abund_deseq2(x, data = "tax_table",
cols = hmp_samples$sample_id,
groups = hmp_samples$body_site)
# Remove taxa with only small differences
per_taxon_fold_changes <- obs(x, data = 'diff_table', value = 'log2FoldChange')
per_taxon_max_change <- unlist(lapply(per_taxon_fold_changes, function(tax_changes) max(abs(tax_changes))))
x <- filter_taxa(x, per_taxon_max_change > 2.5, supertaxa = TRUE, reassign_obs = c(diff_table = FALSE))
I saw you posted this to the google groups and in that question you asked about filtering out NAs as well. You can use the same technique, but again you have to consider if a NA in one comparison should cause the taxon to be removed from all comparisons.
Thanks for the explanation Zachary! In the case of only two comparisons, I can identify my differentially abundant taxa as follows, i.e. padj > 0.05 and Log2FoldChange > 2.5 (upregulated) and < -2.5 (downregulated):
signALL<-obj.physeq.TvS$data$tax_T_v_S %>% filter((padj > 0.05 & log2FoldChange < -2.5) | (padj > 0.05 & log2FoldChange > 2.5))
I was wondering how to implement this filter in a filter_taxa function, so to pass to heat_tree only those taxa really significant (and just not plotting them out of the total)
To take into account the p-values, you could modify the code like so:
library(metacoder)
# Parse data for plotting
x = parse_tax_data(hmp_otus, class_cols = "lineage", class_sep = ";",
class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"),
class_regex = "^(.+)__(.+)$")
# Get per-taxon counts
x$data$tax_table <- calc_taxon_abund(x, data = "tax_data", cols = hmp_samples$sample_id)
# Calculate difference between groups
x$data$diff_table <- calc_diff_abund_deseq2(x, data = "tax_table",
cols = hmp_samples$sample_id,
groups = hmp_samples$body_site)
# Remove taxa with only small differences
per_taxon_fold_changes <- obs(x, data = 'diff_table', value = 'log2FoldChange')
per_taxon_p_values <- obs(x, data = 'diff_table', value = 'pvalue')
per_taxon_max_change <- unlist(lapply(seq_along(per_taxon_fold_changes), function(i) {
if (all(per_taxon_p_values[[i]] > 0.05)) {
return(0)
} else {
return(max(abs(per_taxon_fold_changes[[i]][per_taxon_p_values[[i]] <= 0.05])))
}
}))
x <- filter_taxa(x, per_taxon_max_change > 2.5, supertaxa = TRUE, reassign_obs = c(diff_table = FALSE))
# Plot results (might take a few minutes)
heat_tree_matrix(x,
data = "diff_table",
node_size = n_obs,
node_label = taxon_names,
node_color = ifelse(is.na(padj) | padj > 0.05, 0, log2FoldChange),
node_color_range = diverging_palette(),
node_color_trans = "linear",
node_color_interval = c(-3, 3),
edge_color_interval = c(-3, 3),
node_size_axis_label = "Number of OTUs",
node_color_axis_label = "Log2 fold change")
This would preserve only taxa that have a change (positive or negative) of greater than 2.5 associated with a p value less than .05 in at least one comparison. Does this do what you want?
Hello!
I found this discussion very helpful and I am trying to mimic these steps to create a similar heat tree. I think the problem that I am running into is that my taxa have zero values for at least one one of my samples (but I could be wrong - very amateur at R), resulting in the following message when I run the "calc_diff_abund_deseq2" command:
converting counts to integer mode estimating size factors Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc, : every gene contains at least one zero, cannot compute log geometric means
I read on another discussion board that this can be avoided by adding pseudo-counts, or adding +1, to each value in the otu table. Is there a quick way that you can think to do this? I imagine this will work out fine, as this will not change which taxa are significantly different.
Thanks in advance! Brett
I have not had that problem personally that I remember, but I imagine you have 3 options: 1) filter the input data in some way to remove low-abundance samples. Do you have a negative control with all 0s for example? 2) Add some small amount like 1 to all counts like you said. This is pretty easy to do. It depends a bit on your data structure, but might be as simple as my_table[my_cols] <- my_table[my_cols] + 1
, where my_cols
contains the subset of columns with counts (could be a column in your sample metadata table, if you have one). 3) Use a different method for differential taxon abundance.
Thanks for the quick response!
I had initially followed the Metacoder workshop tutorial (only using my own data) so the names of my files and the structure are similar. I already removed all of the taxa that had no values (e.g. 0 for all samples) and then tried this:
obj$data$otu_counts <- obj$data$otu_counts + 1 Error in FUN(left, right) : non-numeric argument to binary operator
Sorry - really no clue what I'm doing here... but I really appreciate the help!
No problem!
The way that you would do it assuming you have data structured like the workshop would be something like:
obj$data$otu_counts[sample_data$SampleID] <- obj$data$otu_counts[sample_data$SampleID] + 1
Basically, if you add 1 to a table (or matrix), R guesses that you want to add 1 to all the values. However, that might not make sense if some of the columns are not numeric. So what you need to do is just replace the data for the numeric columns with the incremented version, and leave the non-numeric columns alone. Here sample_data$SampleID
is a column in the sample_data
table that corresponds to the column names (sample names) in the abundance matrix.
Referring to the columns for index would also work, but is less readable (e.g. obj$data$otu_counts[1:12345] <- obj$data$otu_counts[1:12345] + 1
)
Amazing - it worked! In the home stretch now :)
Unfortunately when I try to generate the heat tree, now this happens:
> ##Removes taxa with small/not significant differences##
> per_taxon_fold_changes <- obs(obj, data = 'diff_table', value = 'log2FoldChange')
> per_taxon_p_values <- obs(obj, data = 'diff_table', value = 'pvalue')
> per_taxon_max_change <- unlist(lapply(seq_along(per_taxon_fold_changes), function(i) {
+ if (all(per_taxon_p_values[[i]] > 0.05)) {
+ return(0)
+ } else {
+ return(max(abs(per_taxon_fold_changes[[i]][per_taxon_p_values[[i]] <= 0.05])))
+ }
+ }))
> obj <- filter_taxa(obj, per_taxon_max_change > 1.2, supertaxa = TRUE, reassign_obs = c(diff_table = FALSE))
> ##Builds heat tree##
> heat_tree_matrix(obj,
+ data = "diff_table",
+ node_size = n_obs,
+ node_label = taxon_names,
+ node_color = ifelse(is.na(padj) | padj > 0.05, 0, log2FoldChange),
+ node_color_range = diverging_palette(),
+ node_color_trans = "linear",
+ node_color_interval = c(-3, 3),
+ edge_color_interval = c(-3, 3),)
Error: All pairs being compared should have one value per taxon (142). The following do not:
Vehicle vs. Chemo (815)
Did I use the wrong part of the data as input...? Maybe I am entering multiple values on accident...
Thanks again!
That error happens when the data being plotted is not one value per taxon for at least one comparison. Its saying that there are 142 taxa to be plotted, but "diff_table" has 815 rows for one comparison. At some point between calculating taxon differences and plotting, you filtered out taxa without filtering out filtering the differences in "diff_table" in the same way. I cant tell how that might have happened. I suggest check how the number of taxa and number of rows per comparison in "diff_table" change as the code progresses to find which step is causing the problem.
Okay, I took out any filtering steps that I added (like the steps from the tutorial that filter out non-Bacteria and OTUs that had all zero counts) and edited my commands to resemble the above script (posted on Jan 7th)
as closely as possible. This did remove the need for adding the pseduo-counts and the error regarding too many rows for comparison, but now I have a new problem. Everything runs without error until the "heat_tree_matrix()", which returns the error "Error in apply(layout_matrix, MARGIN = 2, function(x) all(is.na(x))) : dim(X) must have a positive length"
I'm copying my code here (and happy to share the files if need-be) - any ideas on what's wrong now? I ultimately just want to show the taxonomic differences (in both positive and negative directions above a determined fold-difference) between my control and treatment group on (perhaps) a single heat tree. I'm just trying to do some hypothesis generation, so I'm also okay with using the un-corrected pvalues (for now). Am I going about this all wrong?
Thanks a TON!!!
##loads readr to use "read_tsv"## library(readr) ##imports OTU table## otu_data <- read_tsv("/Volumes/Secure KeyZ/Chemo_BA/16S/Chemo_For_HeatTrees/ChemoDose_Bioms/DC_Biom.txt") ##imports Taxonomy## tax_data <- read_tsv("/Volumes/Secure KeyZ/Chemo_BA/16S/Chemo_For_HeatTrees/ChemoDose_Taxonomy/taxonomy_edited.txt") ##loads diplyr to use "left_join"## library(dplyr) ##merges Taxonomy and OTU table## otu_data <- left_join(otu_data, tax_data, by = c("OTU_ID" = "OTU_ID")) ##loads necessary packages## library(taxa) library(metacoder) ##imports Metadata## sample_data <- read_tsv("/Volumes/Secure KeyZ/Chemo_BA/16S/DC_sample_data.txt") ##Parse data for plotting## x = parse_tax_data(otu_data, class_cols = "Taxonomy", class_sep = ";", class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"), class_regex = "^([a-z]{0,1})_{0,2}(.*)$")
Get per-taxon counts
x$data$tax_table <- calc_taxon_abund(x, data = "tax_data", cols = sample_data$Sample_ID)
Calculate difference between groups
x$data$diff_table <- calc_diff_abund_deseq2(x, data = "tax_table", cols = sample_data$Sample_ID, groups = sample_data$Treatment)
Remove taxa with only small differences
per_taxon_fold_changes <- obs(x, data = 'diff_table', value = 'log2FoldChange') per_taxon_p_values <- obs(x, data = 'diff_table', value = 'pvalue') per_taxon_max_change <- unlist(lapply(seq_along(per_taxon_fold_changes), function(i) { if (all(per_taxon_p_values[[i]] > 0.05)) { return(0) } else { return(max(abs(per_taxon_fold_changes[[i]][per_taxon_p_values[[i]] <= 0.05]))) } })) x <- filter_taxa(x, per_taxon_max_change > 2.5, supertaxa = TRUE, reassign_obs = c(diff_table = FALSE))
Plot results (might take a few minutes)
heat_tree_matrix(x, data = "diff_table", node_size = n_obs, node_label = taxon_names, node_color = ifelse(is.na(padj) | padj > 0.05, 0, log2FoldChange), node_color_range = diverging_palette(), node_color_trans = "linear", node_color_interval = c(-3, 3), edge_color_interval = c(-3, 3), node_size_axis_label = "Number of OTUs", node_color_axis_label = "Log2 fold change")
Its hard to tell without the data to test. Can you send your code in an .R file and enough data to reproduce the problem so I can see what is going on? Thanks!
Hi Zachary, sorry for extremely slow reply,
I saw you posted this to the google groups and in that question you asked about filtering out NAs as well. You can use the same technique, but again you have to consider if a NA in one comparison should cause the taxon to be removed from all comparisons.
This was resolved when I realized to work with a rarefied otu_table; reverting back to raw count solved everything. Cheers.
per_taxon_max_change <- unlist(lapply(seq_along(per_taxon_fold_changes), function(i) {
if (all(per_taxon_p_values[[i]] > 0.05)) {
return(0)
} else {
return(max(abs(per_taxon_fold_changes[[i]][per_taxon_p_values[[i]] <= 0.05])))
}
}))
x <- filter_taxa(x, per_taxon_max_change > 2.5, supertaxa = TRUE, reassign_obs = c(diff_table = FALSE))
This would preserve only taxa that have a change (positive or negative) of greater than 2.5 associated with a p value less than .05 in at least one comparison. Does this do what you want?
In theory, YES. In practice, I'm getting an error when running that step:
Error in if (all(per_taxon_p_values[[i]] > 0.05)) { :
missing value where TRUE/FALSE needed
Called from: FUN(X[[i]], ...)
what's going on?
S.-
...EDIT... PS. OOK, I have NAs in per_taxon_p_values ;how can i deal with it?
Many R functions have the na.rm
option to ignore NAs, including all
and max
, so this would probably work:
per_taxon_max_change <- unlist(lapply(seq_along(per_taxon_fold_changes), function(i) {
if (all(per_taxon_p_values[[i]] > 0.05, na.rm = TRUE)) {
return(0)
} else {
return(max(abs(per_taxon_fold_changes[[i]][per_taxon_p_values[[i]] <= 0.05]), na.rm = TRUE))
}
}))
x <- filter_taxa(x, per_taxon_max_change > 2.5, supertaxa = TRUE, reassign_obs = c(diff_table = FALSE))
You could also do something like:
per_taxon_max_change <- unlist(lapply(seq_along(per_taxon_fold_changes), function(i) {
pvals <- per_taxon_p_values[[i]]
pvals <- pvals[! is.na(pvals)]
if (length(pvals) == 0 || all(pvals > 0.05)) {
return(0)
} else {
return(max(abs(per_taxon_fold_changes[[i]][pvals <= 0.05])))
}
}))
x <- filter_taxa(x, per_taxon_max_change > 2.5, supertaxa = TRUE, reassign_obs = c(diff_table = FALSE))
Thanks Zachary for the follow-op. Both functions run without warnings or errors, but I get them downstream.
- first function version
> per_taxon_max_change <- unlist(lapply(seq_along(per_taxon_fold_changes), function(i) {
+ if (all(per_taxon_p_values[[i]] > 0.05, na.rm = TRUE)) {
+ return(0)
+ } else {
+ return(max(abs(per_taxon_fold_changes[[i]][per_taxon_p_values[[i]] <= 0.05]), na.rm = TRUE))
+ }
+ }))
> x <- filter_taxa(x, per_taxon_max_change > 2.5,
+ supertaxa = TRUE,
+ reassign_obs = c(diff_table = FALSE))
**Warning message:
There is no "taxon_id" column in the data set "3", so there are no taxon IDs.**
> heat_tree_matrix(x,
+ data = "diff_table",
+ node_size = n_obs,
+ node_label = taxon_names,
+ node_color = ifelse(is.na(padj) | padj > 0.05, 0, log2FoldChange),
+ node_color_range = diverging_palette(),
+ node_color_trans = "linear",
+ node_color_interval = c(-3, 3),
+ edge_color_interval = c(-3, 3),
+ node_size_axis_label = "Number of OTUs",
+ node_color_axis_label = "Log2 fold change")
**Error in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
VECTOR_ELT() can only be applied to a 'list', not a 'NULL'**
- second function version
> per_taxon_max_change <- unlist(lapply(seq_along(per_taxon_fold_changes), function(i) {
+ pvals <- per_taxon_p_values[[i]]
+ pvals <- pvals[! is.na(pvals)]
+ if (length(pvals) == 0 || all(pvals > 0.05)) {
+ return(0)
+ } else {
+ return(max(abs(per_taxon_fold_changes[[i]][pvals <= 0.05])))
+ }
+ }))
> x <- filter_taxa(x, per_taxon_max_change > 2.5, supertaxa = TRUE, reassign_obs = c(diff_table = FALSE))
Error: TRUE/FALSE vector (length = 1352) must be the same length as the number of taxa (760)
thanks for your attention s.-
I just got the VECTOR_ELT() can only be applied to a 'list', not a 'NULL'**
error too yesterday. I reinstalled metacoder
from and it went away. Might be a issue with a dependency. For the 2nd error, did you by any chance try to filter the object x
that was already filtered by the first chunk of code?
Hi @zachary-foster , i am using metacoder to visualise groups that are differentially abundant and significant between nursery and all other forest compartments, I have 21 forest compartments in total. I would like to use the nursery as the baseline for comparison against all the other forest compartments such that the nursery is one group and the other group is all the forest compartments (21 in total). My end goal is to end up with a plot showing nursery against all other forest compartments comparisons...Would the code here work for these comparisons? I have been able to group them into these two categories by splitting the phyloseq object and generating a figure but i'd like to check if this would be the best way to do it or would you recommend doing individual forest compartments vs nursery. My code below for nursery vs all other forest compartments that i've grouped under two separate categories
Extract abundance matrix from the phyloseq object
OTU_nursery = as(otu_table(ps.nursery), "matrix")
Coerce to data.frame
OTU_nurserydf = as.data.frame(t(OTU_nursery))
OTU_nurserydf <- OTU_nurserydf %>% rownames_to_column(var="ASVs")
#Pull out taxonomy info from phyloseq for metacoder plot
Nursery_taxa_root_only_decontam_noNTC_rar=data.frame(phyloseq::tax_table(ps.nursery))
#write csv and edit names accordingly. For metacoder Had to format taxonomy table, each taxonomic level has to have the letter of the level in front of it, for example Kingdom column should have k__ at the front : k__Bacteria. Also need to add a column named Taxonomy with collated into from kingdom to genus separated by ;
write.csv(Nursery_taxa_root_only_decontam_noNTC_rar,file="/raid/projects/scratch/TRMP/RA_2/PM8_1/SG_Nursery_plant_out_trials/R_analysis/Phyloseq_16S_condensed/16S_nursery_Taxa_root_only_decontam_noNTC_rar_forMETACODER.csv")
#Read the edited and formatted taxonomy file Taxa_nursery=read.csv("/raid/projects/scratch/TRMP/RA_2/PM8_1/SG_Nursery_plant_out_trials/R_analysis/Phyloseq_16S_condensed/16S_nursery_Taxa_root_only_decontam_noNTC_rar_forMETACODER_edited.csv")
##Repeat above for for all forest comp except nursery
#All forest compartment without nursery ps.all.except.nursery <- subset_samples(ps_aftercontam_removal_noNTC_root_rar_p, Forest.comp.sort != "0_1") ps.all.except.nursery<- prune_taxa(taxa_sums(ps.all.except.nursery) > 0, ps.all.except.nursery) # prune OTUs that are not present in at least one sample
#Create OTU table
Extract abundance matrix from the phyloseq object
OTU_all_except_nursery = as(otu_table(ps.all.except.nursery), "matrix")
Coerce to data.frame
OTU_all_except_nurserydf = as.data.frame(t(OTU_all_except_nursery))
OTU_all_except_nurserydf <- OTU_all_except_nurserydf %>% rownames_to_column(var="ASVs")
write.csv(OTU_all_except_nurserydf,file="/raid/projects/scratch/TRMP/RA_2/PM8_1/SG_Nursery_plant_out_trials/R_analysis/Phyloseq_16S_condensed/16S_allexcept_nursery_ASVdf_root_only_decontam_noNTC_rar_forMETACODER.csv")
#Repeat for taxonomy table
#Pull out taxonomy info from phyloseq for metacoder plot
All_except_nursery_taxa_root_only_decontam_noNTC_rar=data.frame(phyloseq::tax_table(ps.all.except.nursery))
#write csv and edit names accordingly. For metacoder Had to format taxonomy table, each taxonomic level has to have the letter of the level in front of it, for example Kingdom column should have k__ at the front : k__Bacteria. Also need to add a column named Taxonomy with collated into from kingdom to genus separated by ;
write.csv(All_except_nursery_taxa_root_only_decontam_noNTC_rar,file="/raid/projects/scratch/TRMP/RA_2/PM8_1/SG_Nursery_plant_out_trials/R_analysis/Phyloseq_16S_condensed/16S_allexcept_nursery_Taxa_root_only_decontam_noNTC_rar_forMETACODER.csv")
#Read the edited and formatted taxonomy file Taxa_allexcept_nursery=read.csv("/raid/projects/scratch/TRMP/RA_2/PM8_1/SG_Nursery_plant_out_trials/R_analysis/Phyloseq_16S_condensed/16S_allexcept_nursery_Taxa_root_only_decontam_noNTC_rar_forMETACODER_edited.csv")
Add categories for comparison
Taxa_allexcept_nursery$Category="All Forest compartments" Taxa_nursery$Category="Nursery"
#Bind the two df Taxa_nursery_allFC=rbind(Taxa_nursery,Taxa_allexcept_nursery)
#Merge taxonomy and OTU table
Taxa_nursery_allFC$ASVs
<- as.character(Taxa_nursery_allFC$ASVs
) # Must be same type for join to work
OTU2_root_only_decontam_noNTC_rar$ASVs <- as.character(OTU2_root_only_decontam_noNTC_rar$ASVs) # Must be same type for join to work
OTU_root_df <- left_join(OTU2_root_only_decontam_noNTC_rar, Taxa_nursery_allFC,
by = c("ASVs")) # identifies cols with shared IDs
print(OTU_root_df)
tail(colnames(OTU_root_df), n = 10)
#Read sample data #We want to only keep root in sample data, remove samples that are not in the physeq originally NIR16S_T_9_LCHV7, SIR16S_MC2_C1_LCHV7. Root_sample_data=subset.data.frame(Root_sample_data,Type=="Root") Root_sample_data=subset.data.frame(Root_sample_data,!SampleID%in% c("NIR16S_T_9_LCHV7","SIR16S_MC2_C1_LCHV7"))
#add category for sample data -nursery and all FC Root_sample_data_ed=subset.data.frame(Root_sample_data,Forest.compartment.ID=="Tokoroa nursery") #nursery metadata Root_sample_data_ed2=subset.data.frame(Root_sample_data,Forest.compartment.ID!="Tokoroa nursery") #all other forest comp metadata
#Add category to respective df and recombine Root_sample_data_ed$Category="Nursery" Root_sample_data_ed2$Category="All Forest compartments"
#Bind the two df Root_sample_data_ed3=rbind(Root_sample_data_ed,Root_sample_data_ed2)
##Parse data for plotting## x = parse_tax_data(OTU_root_df, class_cols = "Taxonomy", class_sep = ";", class_key = c(tax_rank = "taxon_rank", tax_name = "taxon_name"), class_regex = "^([a-z]{0,1})_{0,2}(.*)$")
#Get per taxon counts x$data$tax_table <- calc_taxon_abund(x, data = "tax_data", cols = Root_sample_data_ed3$SampleID)
#Calculate difference between groups x$data$diff_table <- calc_diff_abund_deseq2(x, data = "tax_table", cols = Root_sample_data_ed3$SampleID, groups = Root_sample_data_ed3$Category)
#Remove taxa with only small differences per_taxon_fold_changes <- obs(x, data = 'diff_table', value = 'log2FoldChange') per_taxon_p_values <- obs(x, data = 'diff_table', value = 'pvalue') per_taxon_max_change <- unlist(lapply(seq_along(per_taxon_fold_changes), function(i) { if (all(per_taxon_p_values[[i]] > 0.05, na.rm = TRUE)) { return(0) } else { return(max(abs(per_taxon_fold_changes[[i]][per_taxon_p_values[[i]] <= 0.05]), na.rm = TRUE)) } })) x <- filter_taxa(x, per_taxon_max_change > 2.5, supertaxa = TRUE, reassign_obs = c(diff_table = FALSE))
#Plot results heat_tree_matrix(x, data = "diff_table", node_size = n_obs, node_label = taxon_names, node_color = ifelse(is.na(padj) | padj > 0.05, 0, log2FoldChange), node_color_range = diverging_palette(), node_color_trans = "linear", node_color_interval = c(-3, 3), edge_color_interval = c(-3, 3), node_size_axis_label = "Number of OTUs", node_color_axis_label = "Log2 fold change")
I also noticed that the phyloseq before the split contains 7573 ASV but after splitting and recombining them its showing up as 7910 in the plot attached, not sure why ? and is there a way to adjust the wording in the plots so its more visible? Can you help please? Thanks !