genome-grist icon indicating copy to clipboard operation
genome-grist copied to clipboard

comparing and evaluating DNA vs protein matches

Open ctb opened this issue 4 years ago • 5 comments

So, I did a thing, and I don't really know how to evaluate it. Thoughts welcome!

summary: DNA and protein

so I gristed a bunch of the Northen data sets, using both DNA searching (against all of genbank, ~700k) and protein searching (against @bluegenes genus-level GTDB databases).

Then I intersected with Genbank taxonomy and did taxonomic aggregation and summarization (#35).

taxonomy report on DNA search

default config, against genbank.

phylum level DNA: dna gathergram-phylum-SRR5855411

genus level DNA: dna gathergram-genus-SRR5855411

species level DNA: dna gathergram-species-SRR5855411

taxonomy report on protein search

config:

outdir: outputs.roux.protein/
sourmash_scaled: 1000
sourmash_compute_ksizes:
- 11
sourmash_sigtype: protein
sourmash_database_glob_pattern: /home/ntpierce/thumper/databases/gtdb95-genus-n0.protein-k11-scaled100.sbt.zip
sourmash_database_ksize: 33
sourmash_database_threshold_bp: 50e3

phylum level protein: prot gathergram-phylum-SRR5855411

genus level protein prot gathergram-genus-SRR5855411

species level protein: prot gathergram-species-SRR5855411

ctb avatar Dec 24 '20 17:12 ctb

so the question becomes, how do we evaluate this? random thoughts and notes.

  • we would expect protein gather on metagenomes to be much more sensitive to divergent reference genomes than DNA gather.
  • with DNA references, it's fairly straightforward to "validate" the sourmash stuff by mapping the reads and showing that, in general, the hash matches and read mapping coverage are consonant.
  • but for protein, validation is trickier because standard mapping approaches don't really exist.
  • this presents a challenge for figuring out if the protein matches are correct! maybe we could use paladin to do so? @taylorreiter thoughts?
  • the difference in content between databases (~10k genus-level proteomes, vs ~700k genbank genomes) precludes straight up matching.
  • the six frame translation and false positives and different k-mer sizes and probably other things preclude comparing the x axis on the protein vs DNA plots
  • I initially thought about taking the specific DNA genomes and checking them out in protein space, but that goes in the wrong direction!
  • we could maybe calculate a sensitivity of some kind by aggregating the DNA matches to the genus level and then looking to see how many new genera there are? not sure that makes sense.

and just to be clear, this is all @bluegenes fault.

ctb avatar Dec 24 '20 17:12 ctb

Ended up running this on podar, with some pretty good results. I've put the specific data sets and notebooks here, for now: https://github.com/ctb/2021-podar-gathertax

ctb avatar Jan 04 '21 01:01 ctb

I think paladin would be good! Paladin is basically a drop in replacement for BWA, but instead of mapping nucleotide reads to to nucleotide genomes/references, it maps nucleotide reads to protein genomes/references. I'm not sure what makes the most sense for getting the protein reference. I'm pretty sure some genbank genomes have it, but to play nice with genome-grist I think running prokka on the nucleotide genome and using the .faa annotated files that result to map against would probably make the most sense.

taylorreiter avatar Jan 16 '21 17:01 taylorreiter

Thanks taylor!

This gels with some thoughts I had the other day - thought process,

  • we really want to know what fraction of a metagenome can be classified at the protein level
  • the false k-mers that come from not knowing the correct frame of the metagenome reads introduce lots of confounders in doing this the way we do with DNA k-mers
  • we could ask "what fraction of the graph can be classified" a la spacegraphcats https://github.com/spacegraphcats/spacegraphcats/issues/293 but if the graph is really fragmented that might also be very incorrect
  • we could ask "what fraction of reads touch on a contig in the graph that has been classified" but now we're talking about reads, not k-mers ;)
  • we could do other tricky things like using reads to "extend" classified/matched contigs to other contigs, but seems messy and slow

so the best is to map reads to a proteome with (e.g.) paladin, and estimate fraction covered that way. We still face the challenge that the proteome is less than 100% of the genome (obviously :) but we could either simply state that or correct for it in our output.

ctb avatar Jan 16 '21 17:01 ctb

@bluegenes suggests that we try lenient mapping in DNA space as well!

ctb avatar Feb 02 '21 18:02 ctb