charcoal icon indicating copy to clipboard operation
charcoal copied to clipboard

talk about how to troubleshoot or explore charcoal "predictions"

Open ctb opened this issue 4 years ago • 1 comments

it'd be super nice to have some ways for people to dig into charcoal predictions of contamination. we can't provide it as part of the default charcoal workflow unless we want to provide access to the core GTDB genomes (which I don't want to do, at least not yet) but we can provide notebooks and workflows that rely on minimal set up to do the necessary.

of course, we have lots of ways to do this with sourmash, but it'd be nice to convey our top approaches to others in a way that would be useful and executable!

I did this a little bit myself with sourmash-oddify, which runs nucmer and then produces summary reports.

@taylorreiter pointed out this mummer2circos software,

https://github.com/metagenlab/mummer2circos

and suggested that we could provide some approaches that use it to diagram alignments between contaminant contigs etc. etc.

SRR2857885_bin 40_nucmer2circos

SRR1793410_bin 1_nucmer2circos

specifically,

I’m thinking the MAG could be a reference, and then all of the genomes Identified by charcoal could be aligned. it would show that 90% aligns to one reference, and the rest aligns to a small portion of contaminant genomes

ctb avatar Aug 28 '20 19:08 ctb

one suggestion is to build a workflow or tutorial around anvi'o for further inspection.

another suggestion is to check the distribution of potentially contaminated hashes across the genome. If they are concentrated in a single fragmented contig, that is indicative of contamination... if they are spread throughout the genome, that might be a sign that they are not contamination. This is kind of flagged already by "amount of sequence removed" (since more sequence will be removed in the case where they are spread throughout genome), but we could be more explicit here. Some of the gtdb-contam-dna genomes might be good here.

a third question/suggestion - should we articulate the principle that, at least with the parameters we've chosen, users should be looking for reasons to justify keeping the contigs, rather than looking further to justify discarding the contigs? I feel a bit like the conversation around MAGs has been flipped towards "they're probably good" rather than "this is the output of a computational hypothesis that needs to be further justified."

ctb avatar Sep 14 '20 14:09 ctb