khmer
khmer copied to clipboard
[WIP] Enable banding for computing k-mer abundance distributions
Banding stuff has been pretty well battle tested in kevlar, but hasn't made it to the khmer CLI yet. This requires not only adding arguments to support banding, but integrating *table objects into the tooling originally designed only for *graph objects.
- [ ] Is it mergeable?
- [ ]
make testDid it pass the tests? - [ ]
make clean diff-coverIf it introduces new functionality inscripts/is it tested? - [ ]
make format diff_pylint_report cppcheck doc pydocstyleIs it well formatted? - [ ] Did it change the command-line interface? Only backwards-compatible additions are allowed without a major version increment. Changing file formats also requires a major version number increment.
- [ ] For substantial changes or changes to the command-line interface, is it
documented in
CHANGELOG.md? See keepachangelog for more details. - [ ] Was a spellchecker run on the source code and documentation after changes were made?
- [ ] Do the changes respect streaming IO? (Are they tested for streaming IO?)
This depends on consume_seqfile_banding_with_reads_parser from #1753.
Some initial benchmarking this weekend: ran abund-dist-single.py with & without banding (--banding 8 1) and with & without threading (--threads 4). Banding does provide a runtime performance improvement, but it appears that the improvement doesn't stack with improvements from threading.
Hmm. The cyclic hash is not only faster than the murmur hash (we already knew that), but its speed benefits seem to stack better with threading. The commands I used:
time scripts/abundance-dist-single.py --hash-function cyclic --max-memory-usage 25M --ksize 25 ~/Desktop/kevlar/scratch/nano-trio/proband-reads.fq.gz nobanding-nothreads.abunddist
time scripts/abundance-dist-single.py --hash-function cyclic --banding 8 1 --max-memory-usage 25M --ksize 25 ~/Desktop/kevlar/scratch/nano-trio/proband-reads.fq.gz banding-nothreads.abunddist
time scripts/abundance-dist-single.py --hash-function cyclic --threads 4 --max-memory-usage 25M --ksize 25 ~/Desktop/kevlar/scratch/nano-trio/proband-reads.fq.gz nobanding-threads.abunddist
time scripts/abundance-dist-single.py --hash-function cyclic --threads 4 --banding 8 1 --max-memory-usage 25M --ksize 25 ~/Desktop/kevlar/scratch/nano-trio/proband-reads.fq.gz banding-threads.abunddist