genome-grist
genome-grist copied to clipboard
comparing and evaluating DNA vs protein matches
So, I did a thing, and I don't really know how to evaluate it. Thoughts welcome!
summary: DNA and protein
so I gristed a bunch of the Northen data sets, using both DNA searching (against all of genbank, ~700k) and protein searching (against @bluegenes genus-level GTDB databases).
Then I intersected with Genbank taxonomy and did taxonomic aggregation and summarization (#35).
taxonomy report on DNA search
default config, against genbank.
phylum level DNA:
genus level DNA:
species level DNA:
taxonomy report on protein search
config:
outdir: outputs.roux.protein/
sourmash_scaled: 1000
sourmash_compute_ksizes:
- 11
sourmash_sigtype: protein
sourmash_database_glob_pattern: /home/ntpierce/thumper/databases/gtdb95-genus-n0.protein-k11-scaled100.sbt.zip
sourmash_database_ksize: 33
sourmash_database_threshold_bp: 50e3
phylum level protein:
genus level protein
species level protein:
so the question becomes, how do we evaluate this? random thoughts and notes.
- we would expect protein gather on metagenomes to be much more sensitive to divergent reference genomes than DNA gather.
- with DNA references, it's fairly straightforward to "validate" the sourmash stuff by mapping the reads and showing that, in general, the hash matches and read mapping coverage are consonant.
- but for protein, validation is trickier because standard mapping approaches don't really exist.
- this presents a challenge for figuring out if the protein matches are correct! maybe we could use paladin to do so? @taylorreiter thoughts?
- the difference in content between databases (~10k genus-level proteomes, vs ~700k genbank genomes) precludes straight up matching.
- the six frame translation and false positives and different k-mer sizes and probably other things preclude comparing the x axis on the protein vs DNA plots
- I initially thought about taking the specific DNA genomes and checking them out in protein space, but that goes in the wrong direction!
- we could maybe calculate a sensitivity of some kind by aggregating the DNA matches to the genus level and then looking to see how many new genera there are? not sure that makes sense.
and just to be clear, this is all @bluegenes fault.
Ended up running this on podar, with some pretty good results. I've put the specific data sets and notebooks here, for now: https://github.com/ctb/2021-podar-gathertax
I think paladin would be good! Paladin is basically a drop in replacement for BWA, but instead of mapping nucleotide reads to to nucleotide genomes/references, it maps nucleotide reads to protein genomes/references. I'm not sure what makes the most sense for getting the protein reference. I'm pretty sure some genbank genomes have it, but to play nice with genome-grist I think running prokka on the nucleotide genome and using the .faa
annotated files that result to map against would probably make the most sense.
Thanks taylor!
This gels with some thoughts I had the other day - thought process,
- we really want to know what fraction of a metagenome can be classified at the protein level
- the false k-mers that come from not knowing the correct frame of the metagenome reads introduce lots of confounders in doing this the way we do with DNA k-mers
- we could ask "what fraction of the graph can be classified" a la spacegraphcats https://github.com/spacegraphcats/spacegraphcats/issues/293 but if the graph is really fragmented that might also be very incorrect
- we could ask "what fraction of reads touch on a contig in the graph that has been classified" but now we're talking about reads, not k-mers ;)
- we could do other tricky things like using reads to "extend" classified/matched contigs to other contigs, but seems messy and slow
so the best is to map reads to a proteome with (e.g.) paladin, and estimate fraction covered that way. We still face the challenge that the proteome is less than 100% of the genome (obviously :) but we could either simply state that or correct for it in our output.
@bluegenes suggests that we try lenient mapping in DNA space as well!