khmer
khmer copied to clipboard
[WIP] Prototype: count k-mers in banded mode in a single pass
This PR is a Python prototype of the idea described in #1800. In brief, the idea is to count k-mers in the input reads and populate N counttables (each with ≈1/N of the k-mers) in a single pass.
This prototype implements a k-mer buffer that, when "full", writes to disk and flushes/resets. Once all the input reads are processed, the temp files are read back from disk 1-by-1 to populate a Counttable for each band.
The prototype works as expected, but the performance is painfully slow on large data sets. It would need a Cython or C++ implementation with threading support to be viable methinks. I would welcome any thoughts. cc @camillescott @betatim @ctb @luizirber
- [ ] Is it mergeable?
- [ ]
make testDid it pass the tests? - [ ]
make clean diff-coverIf it introduces new functionality inscripts/is it tested? - [ ]
make format diff_pylint_report cppcheck doc pydocstyleIs it well formatted? - [ ] Did it change the command-line interface? Only backwards-compatible additions are allowed without a major version increment. Changing file formats also requires a major version number increment.
- [ ] For substantial changes or changes to the command-line interface, is it
documented in
CHANGELOG.md? See keepachangelog for more details. - [ ] Was a spellchecker run on the source code and documentation after changes were made?
- [ ] Do the changes respect streaming IO? (Are they tested for streaming IO?)
curl -L https://osf.io/f5trh/download?version=1 -o reads.fq.gz
sandbox/count-band-single-pass.py \
--ksize 25 --num-bands 4 --buffersize 1e7 --memory 1e7 \
--outfmt "bandtest{}.ct" \
reads.fq.gz
The procedure above will result in the same output as the Python commands below, which do 4 passes over the reads.
for i in range(4):
counts = khmer.Counttable(25, 1e7 / 4, 4)
counts.consume_seqfile_banding('reads.fq.gz', 4, i)
of = 'bandtest{}.ct'.format(i)
counts.save(of)
For my education: why split into bands but then create N counttables? I understand "let's make N bands because we have far too much data anyway" but that doesn't fit with then making N bands. Tim is confused.com
We are investigating some use cases where we compute only on a single band, but there is at least one use case (motivated by kevlar and variant discovery) where we want to compute on each band independently. In this case, the banding provides us a way of randomly sharding the k-mers into N disjoint sets, each of which only requires 1/N of the memory required to store the entire data set.
This PR is looking for an efficient way to compute counts for all bands in a single first pass, rather than N passes.