khmer
khmer copied to clipboard
How to export kmer abundances?
Hello there!
I'm interested in using khmer to count and filter kmers from transcriptome datasets and then use the raw kmer counts across multiple (~10) samples as the "X" matrix for a classification algorithm e.g. in scikit-learn or tensorflow. How can one combine and extract the kmer counts to say an hdf5 or sparse matrix file?
Warmest,
Olga
Hi Olga, I think this is better done with sourmash, actually; sourmash applies a random subsampling algorithm (MinHash) to extract a subset of k-mers, which is much more manageable than using all of them! We have a decent-ish Python API (well, the API is fine, but the docs are a bit underpolished).
If you really want all k-mers, then I can point you towards another set of code, bbhash (https://github.com/dib-lab/pybbhash and references there-in - we just wrote a python wrapper for it :) that will let you construct a minimal perfect hash function for tracking (and counting) k-mers.
The latter route involves a bit more alpha code but I can give you instructions for getting started. Thoughts on which approach seems more interesting?
Thanks for the response, I'll try out sourmash for now! I'm not deep enough into the project to want to deal with alpha code :)
Will keep you posted!
though if I have time, is there a way to use pybbhash on existing kmer graphs created by khmer? I'd like to put these 60GB files to good use
Is there a way to use pybbhash on existing kmer graphs created by khmer?
I don't think so. The hashing strategy is quite different between khmer and the MPHF used by pybbhash. :(
@luizirber is there a sourmash-native rust fn we can use for kmer counting with the recent version of sourmash?
@luizirber is there a sourmash-native rust fn we can use for kmer counting with the recent version of sourmash?
Hmm, I guess you could use scaled=1 and the BTree-backed MinHash from https://github.com/dib-lab/sourmash/pull/1045, but this is only exposed to sourmash compute CLI, not in the Python API*. But that would still use a lot of memory for large datasets...
- sort of. It wouldn't be hard to expose, tho.
a few things --
- pybbhash is much less alpha now - we've been using it in spacegraphcats quite successfully and I added some testing on both pybbhash AND in spacegraphcats. It would be straightforward to use it for k-mer counting, although you'd have to go across the data twice to use it. LMK if you want a code example!
- if you're interested in counting the subset of k-mers, see e.g. https://github.com/dib-lab/sourmash/pull/933 - probably not what you're looking for but thought I'd post it just in case.