charcoal next steps for charcoal

as of the merge of #160, charcoal now performs the following steps -

find candidate genomes for contamination analysis, based on sourmash shared hashes with GTDB r95
download genomes and perform mashmap analyses to find alignments
report and present on alignments

What next?

Well.

Digging into the GenBank contaminated samples, we see the following categories --

hard to interpret

GCF_001672295.1_genomic.fna.gz - many small alignments

confused taxonomy

(probably not contamination)

GCF_002154655.1_genomic.fna.gz

GCF_001683825.1_genomic.fna.gz

bad taxonomy/real contamination

GCA_003222275.1_genomic.fna.gz

GCA_003222535.1_genomic.fna.gz

GCF_001184205.1_genomic.fna.gz

GCF_000492175.1_genomic.fna.gz

GCF_000763125.1_genomic.fna.gz

GCF_001078575.1_genomic.fna.gz

no actual contamination / mispredicted by sourmash

(maybe try with nucmer?)

GCF_900016235.2_genomic.fna.gz

GCF_001749745.1_genomic.fna.gz

GCA_001421185.1_genomic.fna.gz

what next?

I'm thinking about another round of processing, after the alignment stage, in order to identify and prioritize actual contamination. It's clear that sourmash is reporting some things for which there is no actual contamination (based on the last category above), and it's clear that detailed genome-to-genome alignment offers a lot of resolution.

Dec 01 '20 16:12 ctb

the simplest postprocessing thought I can come up with is this: perhaps we should move to prioritizing contigs for removal based on what fraction of the contig aligns to Bad Genomes (genomes with bad taxonomy)? If it's more than (say) 20%, or 50%, or 80%, we can flag it for removal.

We could also filter the alignment view (for both clean and dirty) in this way.

This will deal with the majority of confused taxonomy and bad predictions, above, I suspect.

Dec 01 '20 16:12 ctb

a few thoughts --

first, this kind of alignment approach could be useful for flagging confused taxonomies, where they are probably not contamination (bacteroides / firmicutes, for example).

second, I'm thinking of three explicit stages:

stage 1 - find candidate contaminant genomes with sourmash stage 2 - download and do genome by genome alignments stage 3 - reports using (filtered) alignments; output dirty/clean contigs.

third, I think I should try running charcoal on gtdb r95 entire, now, and see what it turns up 😁

Dec 02 '20 16:12 ctb

next steps for charcoal - some thoughts

hard to interpret

confused taxonomy

bad taxonomy/real contamination

no actual contamination / mispredicted by sourmash

what next?