ganon
ganon copied to clipboard
Build and classification parameters suggestions based on data type (aDNA)
Hi again, another thing that I would absolutely love your input on is build and classification parameters based on the type of data I work with, which is ancient DNA, so the reads are short (average length around 70 bp; shortest are around 30 bp) and damaged.
If it would help, here is some more information about what we'd be using ganon for: We are trying to use read classification tools for multiple aDNA projects - so keep in mind, everything is short and damaged. We’re thinking of building a custom database that has target species (along with human and Unive_Core) in the hopes of being able to use it to identify samples to the species level for bones that are morphologically hard to distinguish. We also want to build a much larger database with ALL the RefSeq genomes (limited to 3 assemblies per species) - we're probably going to split this database into categories since ganon can classify reads to multiple dbs. We were hoping to use this RefSeq db for sediment and coprolite projects - some examples of what we are trying to find include environment/host microbial communities, host diet, and for some of the sediment stuff we want to see if we can identify animal community too.
Since the reads are short, it would be nice to have shorter kmer lengths to maximize the number of possible kmers per read to match to reference kmers. But I can also see how the complexity/uniqueness of the kmers is decreased as you shorten them, possibly making spurious matches more prevalent. I’m trying to find that balance with the build, do you have a suggestion for kmer-size, minimizer-size, and possibly false positive rate for a db build for classifying ancient DNA?
On the classification side, I’m struggling with finding balance with the cutoff and filter options for similar reasons. Since ancient reads are short, it limits the possible number of kmers per read, meaning the possible kmers shared between a read and reference is also limited. So, finding a balance where I have strict enough parameters where spurious matches are filtered out but also where short reads can be classified and allow for potential errors in matching due to ancient damage. If you can at least point me in the right direction, that would be great! Maybe suggest a range for cutoff and filter that I can test out with some ancient data?
And for reads with multiple matches, would you use LCA or the EM algorithm? Is there a reason you’d use one over the other? Also, it seems like the EM and LCA algorithms to solve multiple-matches are not necessarily mutually exclusive? Is there a way to use them in a hierarchical way or apply one then the other? Do you recommend doing that? Anything else you can think of that I should consider based on my data type would be much appreciated!