metacoder
metacoder copied to clipboard
Heattree with continuous data / any arbitrary color vector
Hi Grunwaldlab, Thanks for the amazing package. It's beautiful.
Quick question (has come up before here, but no appropriate solutions found as far as I can tell): I'm running an analysis on a phyloseq object which has associated coefficients and p-values from an mvabund analysis along a continuous gradient (could be DESEQ2 as well). Hence each taxon has a coefficient associated with it.
How to plot this in a heattree? In something analagous to compare_groups, but along a continuous gradient. Basically, it should be simpler, because the colors should just be able to be colored according to any arbitrary vector, right? However, I can't seem to get this to work.
I've added a dataset to the metacoder object, lined up taxon_id rows properly, but can't seem to call it from heat_tree.
I appreciate your help & will happily upload some code if that'll be helpful.
Hello,
Thanks!
because the colors should just be able to be colored according to any arbitrary vector, right?
Yea, you should be able to use any numeric vector associated with taxa in a 1 to 1 relationship.
upload some code if that'll be helpful.
Yea, it would be helpful if you could upload some code and a sample of the associated data so I can get a better idea what the goal and problem are. Thanks!
Hi Zachary, Thanks for writing back! This is the best case scenario: if the package has the capability, then it just means I'm doing something wrong! I've been mucking about with it for a few days so I must be missing something.
Briefly: this is a dataset of ITS ASVs along riparian zones. I'm testing the effect of salmon density (continuous variable) on neighbouring soils. I've run the ASV tables through mvabund (or other pipelines), and thus each ASV has an associated 'salmon_effect' coefficient and p-values. I've taken them out and added them to the tax_ data slot in the metacoder object (and tried a few other things). Basically: I'm not sure where the 'coefficient' data should go!
Here's a reprex of a similar situation that gives me identical errors, faking my coefficient data with rnorm function: (apologies if I've screwed this up, it's my first time posting a question to github and I'm still learning!)
Reprex:
library(phyloseq) library(metacoder) library(reprex)
obj <- parse_tax_data(hmp_otus, class_cols = "lineage", # the column that contains taxonomic information class_sep = ";", # The character used to separate taxa in the classification class_regex = "^(.+)__(.+)$", # Regex identifying where the data for each taxon is class_key = c(tax_rank = "info", # A key describing each regex capture group tax_name = "taxon_name"))
set.seed(1) # This makes the plot appear the same each time it is run heat_tree(obj, node_label = taxon_names, node_size = n_obs, node_color = n_obs, node_size_axis_label = "OTU count", node_color_axis_label = "Samples with reads", layout = "davidson-harel", # The primary layout algorithm initial_layout = "reingold-tilford") # The layout algorithm that initializes node locations
This works great!
########
Now Add a vector to tax_data: (Q: Where should this go?)
length(obj$data$tax_data$otu_id) effect_coef<-rnorm(1000,mean=0,sd=1)
obj$data$tax_data$effect<-effect_coef
set.seed(1) # This makes the plot appear the same each time it is run heat_tree(obj, node_label = taxon_names, node_size = n_obs, node_color = effect_coef, node_size_axis_label = "OTU count", node_color_axis_label = "Samples with reads", layout = "davidson-harel", # The primary layout algorithm initial_layout = "reingold-tilford") # The layout algorithm that initializes node locations
Gets error: "Error in check_element_length(c("node_size", "edge_size", "node_label_size", :
Length of argument'node_color' must be a factor of the length of 'taxon_id'
heat_tree(obj, node_label = taxon_names, node_size = n_obs, node_color = obj$data$tax_data$effect_coef, node_size_axis_label = "OTU count", node_color_axis_label = "Samples with reads", layout = "davidson-harel", # The primary layout algorithm initial_layout = "reingold-tilford") # The layout algorithm that initializes node locations
Gets another error
Thanks for the reprex! Sorry for the delay. The issue is that you are trying to plot values that are not per-taxon. Your ASVs are associated with taxa, but the data you want to plot is associated with the ASVs, so you need to decide on a way to combine the per-ASVs coefficient/p-values into per-taxon values, since each taxon will have one or more ASVs assigned to it. The "tax_data" dataset is per-ASV ("tax_data" is just a default name that is kindof deceptive in this case), so plotting columns for that table does not work.
I can think of two ways to do this, but there are probably others:
- Calculate a per-taxon value from the ASV data by calculating the mean coefficient per taxon for ASVs that have a significant p-value. You can add the coefficients and p-values to the per-ASV data table like you did in the reprex, then use the
obs(obj, data = 'tax_data')function to get the row indexes for the ASVs assigned to each taxon. The result ofobs(obj, data = 'tax_data')can be looped over to calculate a per-taxon mean coefficient with something like:
unlist(lapply(obs(obj, data = 'tax_data'), function(i) {
taxon_coeff <- obj$data$tax_data$my_coeff[i] # Use your col names
taxon_pval <- obj$data$tax_data$my_p_value[i] # Use your col names
if (all(taxon_pval > 0.05)) {
return(0)
} else {
return(mean(taxon_coeff[taxon_pval <= 0.05]))
}
}))
You can add the result of that to obj$data and use it in the plot.
- Rerun the mvabund tests with per-taxon counts (like those produced with
calc_taxon_abund) and plot coefficients that have significant p-values.
Let me know if you have questions
THANK YOU FOR THIS ZACH!
You are correct in that my ASVs were assigned multiple taxons at the 'species' level. I had initially run a model similar to your approach # 2, but didn't appreciate that multiple ASVs were assigned single species taxon_ids. This led me to read up more closely on how this is handled in your 'taxa' package and I need to say that package is beautiful too. I look forward to using it extensively in future!
If helpful, what would have made this clearer to me as a first-time user would be the error code:
Error in check_element_length(c("node_size", "edge_size", "node_label_size", : Length of argument'node_color' must be a factor of the length of 'taxon_id'
Be changed to something involving 'length of unique taxon_id' (although this is clearer once one understands how taxa handles the ids).
I've currently implemented your solution # 1 and things are working well. I'll play around with it and may modify it to weight coefficients by relative abundance or something like that - if I find this helpful I'll upload my code for others once I iron the bugs out.
With care & thanks again, Allen
No problem! I am glad it worked.
I had initially run a model similar to your approach # 2, but didn't appreciate that multiple ASVs were assigned single species taxon_ids.
Yea, that's a common point of confusion. Eventually I want to make things automatically convert from per-ASV to per-taxon data, but I need to think of a good way to do it. Anyway, its better to do it manually so you can pick the best conversion method for the study.
This led me to read up more closely on how this is handled in your 'taxa' package...
Thanks! We are working on a complete rewrite of taxa as well, so some things will change in the future
If helpful, what would have made this clearer to me as a first-time user would be the error code:
Yea, good idea, I will change the wording.