next steps for charcoal - some thoughts
as of the merge of #160, charcoal now performs the following steps -
- find candidate genomes for contamination analysis, based on sourmash shared hashes with GTDB r95
- download genomes and perform mashmap analyses to find alignments
- report and present on alignments
What next?
Well.
Digging into the GenBank contaminated samples, we see the following categories --
hard to interpret
GCF_001672295.1_genomic.fna.gz - many small alignments
confused taxonomy
(probably not contamination)
GCF_002154655.1_genomic.fna.gz
GCF_001683825.1_genomic.fna.gz
bad taxonomy/real contamination
GCA_003222275.1_genomic.fna.gz
GCA_003222535.1_genomic.fna.gz
GCF_001184205.1_genomic.fna.gz
GCF_000492175.1_genomic.fna.gz
GCF_000763125.1_genomic.fna.gz
GCF_001078575.1_genomic.fna.gz
no actual contamination / mispredicted by sourmash
(maybe try with nucmer?)
GCF_900016235.2_genomic.fna.gz
GCF_001749745.1_genomic.fna.gz
GCA_001421185.1_genomic.fna.gz
what next?
I'm thinking about another round of processing, after the alignment stage, in order to identify and prioritize actual contamination. It's clear that sourmash is reporting some things for which there is no actual contamination (based on the last category above), and it's clear that detailed genome-to-genome alignment offers a lot of resolution.
the simplest postprocessing thought I can come up with is this: perhaps we should move to prioritizing contigs for removal based on what fraction of the contig aligns to Bad Genomes (genomes with bad taxonomy)? If it's more than (say) 20%, or 50%, or 80%, we can flag it for removal.
We could also filter the alignment view (for both clean and dirty) in this way.
This will deal with the majority of confused taxonomy and bad predictions, above, I suspect.
a few thoughts --
first, this kind of alignment approach could be useful for flagging confused taxonomies, where they are probably not contamination (bacteroides / firmicutes, for example).
second, I'm thinking of three explicit stages:
stage 1 - find candidate contaminant genomes with sourmash stage 2 - download and do genome by genome alignments stage 3 - reports using (filtered) alignments; output dirty/clean contigs.
third, I think I should try running charcoal on gtdb r95 entire, now, and see what it turns up 😁