charcoal
charcoal copied to clipboard
what database(s) do we want to use for charcoal?
GTDB 25k is all well and good, but probably not as sensitive as all of genbank.
could we / should we build a "screened" genbank where we include any genome in genbank that has no significant (sourmash gather) matches in GTDB? would be fairly straightforward to do.
If we do, should we run checkm/gtdbtk on them as well to estimate contamination? Or rather, should we do any sort of further curation?
I guess we could have 3 databases: gtdb25k, gtdb140k, gtdb140k + (genbank - gtdb). The documentation could have "buyer beware" for the third one