metaGEM
metaGEM copied to clipboard
MAG abundance across samples
Hi! It seems the 'abundance' rule maps each sample's reads against the bins generated from that same sample. Reads are not mapped against bins generated from other samples. Is this understanding correct?
If so - why not create one output directory of dereplicated bins from the whole analysis (i.e. flatten the output from reassembled_bins/{sampleID}/reassembled_bins), and map each sample's reads against that directory? Otherwise, how else could you evaluate the abundance of a single species among samples?
I am testing this with the outputs from a relatively unsuccessful analysis (still running the pipeline with better assembly parameters). I input 48 bins (which were mostly very low quality, minimum completeness only 15% to ensure every sample would pass through the metaGEM pipeline). These 48 were combined into only 2 genome clusters. I do expect the results to change when I input better genomes.
Here is how I just tried implementing this using anvi'o. This R code creates the anvi'o input file (with genome names and file paths):
b_directories <- list.files("./reassembled_bins/", pattern = "HARV", full.names = T)
out <- data.frame(name = NULL, path = NULL)
for (sample in b_directories){
samp.bins <- list.files(paste0(sample, "/reassembled_bins"), pattern=".fa", full.names = T)
bin_names <- paste0(basename(sample), ".", basename(samp.out))
samp.bin.info <- cbind(name = bin_names, path = samp.out)
out <- rbind(out, samp.bin.info)
}
write_tsv(out, "./external-genomes.tsv")
Then, from within the same metaGEM directory:
module load fastani
module load miniconda
module load anvio
conda activate anvio-7.0
anvi-dereplicate-genomes -f external-genomes.tsv -o derep_genomes/ —-skip-fasta-report --program fastANI --similarity-threshold 0.85 —-representative-method Qscore
The two bins selected as "representative" for each cluster could then be used for mapping. Do you think this is a sound approach?
Hi Zoey,
That is correct, the abundance
rule will only quantify the abundance of MAGs within the sample they are generated from. The assumption/rationale is that users will also assign taxonomy to each MAG, and so you can evaluate the abundance of species across samples. This is how we generated sup. fig. 5 in the paper, where we look at the abundance of community members across a time series:
However, I can see how this may lead to issues for novel/unknown species or simply even MAGs that do not get species level assignment. I think that your approach defintely makes sense, and would likely be the best course of action for such cases. I have created an issue (#68) for adding dRep
to the workflow for MAG dereplication.
Please let me know if you have further questions! Best, Francisco