sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

how should we adjust output when using 6-frame-translated signatures?

Open ctb opened this issue 3 years ago • 2 comments

in charcoal we are trying out 6-frame translations to do decontamination: https://github.com/dib-lab/charcoal/pull/120. The gather output that is reported is pretty lousy because it doesn't adjust for the (large) number of false negatives that comes from comparing a 6-frame translation signature to a database constructed with --input-is-protein.

I wonder if there's anything we can do about this? Seems ...tricky. I'm not even sure we currently track enough information to flag when this is happening!

relevant to #999

ctb avatar Jul 08 '20 14:07 ctb

This continues to be a problem when running 6-frame translated read searches against protein databases for classification. We know the % classified will be incorrect, but I'm not sure we have enough information to produce a "correct" % classification, since we haven't properly evaluated the number of k-mers coming from incorrect ORF's that do map to reference databases.

We could:

  • definitively assess how many incorrect ORF k-mers map at, e.g. k10. If very few, we could potentially multiply to produce an approximate % classification? If this, we would want to warn users of this behavior, so would need translated sketch info (#2219 )
  • We could run orpheum to find the correct ORF prior to running gather and compare results with 6-frame translation
  • ...

bluegenes avatar Aug 17 '22 01:08 bluegenes

...especially a problem for downstream use of gather --> tax, e.g. krona output, where the 'fraction' reported is of the 6-frame translated sketch...

fraction        superkingdom    phylum  class   order   family  genus   species
0.0337919997763739      Eukaryota       Alveolata       Dinophyta       Dinophyceae     Dinophyceae_X   Dinophyceae_XX  Kryptoperidinium
0.030124298839007847    Eukaryota       Alveolata       Dinophyta       Dinophyceae     Dinophyceae_X   Dinophyceae_XX  Alexandrium
0.028228912369629093    Eukaryota       Alveolata       Dinophyta       Dinophyceae     Dinophyceae_X   Dinophyceae_XX  Scrippsiella
0.018354810756415273    Eukaryota       Alveolata       Dinophyta       Dinophyceae     Dinophyceae_X   Dinophyceae_XX  Karenia
0.015011290011988844    Eukaryota       Alveolata       Dinophyta       Dinophyceae     Dinophyceae_X   Suessiales      Symbiodinium

bluegenes avatar Aug 17 '22 01:08 bluegenes