khmer icon indicating copy to clipboard operation
khmer copied to clipboard

How to export kmer abundances?

Open olgabot opened this issue 7 years ago • 7 comments
trafficstars

Hello there! I'm interested in using khmer to count and filter kmers from transcriptome datasets and then use the raw kmer counts across multiple (~10) samples as the "X" matrix for a classification algorithm e.g. in scikit-learn or tensorflow. How can one combine and extract the kmer counts to say an hdf5 or sparse matrix file? Warmest, Olga

olgabot avatar May 13 '18 22:05 olgabot

Hi Olga, I think this is better done with sourmash, actually; sourmash applies a random subsampling algorithm (MinHash) to extract a subset of k-mers, which is much more manageable than using all of them! We have a decent-ish Python API (well, the API is fine, but the docs are a bit underpolished).

If you really want all k-mers, then I can point you towards another set of code, bbhash (https://github.com/dib-lab/pybbhash and references there-in - we just wrote a python wrapper for it :) that will let you construct a minimal perfect hash function for tracking (and counting) k-mers.

The latter route involves a bit more alpha code but I can give you instructions for getting started. Thoughts on which approach seems more interesting?

ctb avatar May 14 '18 13:05 ctb

Thanks for the response, I'll try out sourmash for now! I'm not deep enough into the project to want to deal with alpha code :)

Will keep you posted!

olgabot avatar May 14 '18 17:05 olgabot

though if I have time, is there a way to use pybbhash on existing kmer graphs created by khmer? I'd like to put these 60GB files to good use

olgabot avatar May 14 '18 18:05 olgabot

Is there a way to use pybbhash on existing kmer graphs created by khmer?

I don't think so. The hashing strategy is quite different between khmer and the MPHF used by pybbhash. :(

standage avatar May 14 '18 18:05 standage

@luizirber is there a sourmash-native rust fn we can use for kmer counting with the recent version of sourmash?

phiweger avatar Nov 17 '20 09:11 phiweger

@luizirber is there a sourmash-native rust fn we can use for kmer counting with the recent version of sourmash?

Hmm, I guess you could use scaled=1 and the BTree-backed MinHash from https://github.com/dib-lab/sourmash/pull/1045, but this is only exposed to sourmash compute CLI, not in the Python API*. But that would still use a lot of memory for large datasets...

  • sort of. It wouldn't be hard to expose, tho.

luizirber avatar Nov 19 '20 00:11 luizirber

a few things --

  • pybbhash is much less alpha now - we've been using it in spacegraphcats quite successfully and I added some testing on both pybbhash AND in spacegraphcats. It would be straightforward to use it for k-mer counting, although you'd have to go across the data twice to use it. LMK if you want a code example!
  • if you're interested in counting the subset of k-mers, see e.g. https://github.com/dib-lab/sourmash/pull/933 - probably not what you're looking for but thought I'd post it just in case.

ctb avatar Nov 20 '20 14:11 ctb