khmer icon indicating copy to clipboard operation
khmer copied to clipboard

[WIP] Enable banding for computing k-mer abundance distributions

Open standage opened this issue 8 years ago • 3 comments

Banding stuff has been pretty well battle tested in kevlar, but hasn't made it to the khmer CLI yet. This requires not only adding arguments to support banding, but integrating *table objects into the tooling originally designed only for *graph objects.

  • [ ] Is it mergeable?
  • [ ] make test Did it pass the tests?
  • [ ] make clean diff-cover If it introduces new functionality in scripts/ is it tested?
  • [ ] make format diff_pylint_report cppcheck doc pydocstyle Is it well formatted?
  • [ ] Did it change the command-line interface? Only backwards-compatible additions are allowed without a major version increment. Changing file formats also requires a major version number increment.
  • [ ] For substantial changes or changes to the command-line interface, is it documented in CHANGELOG.md? See keepachangelog for more details.
  • [ ] Was a spellchecker run on the source code and documentation after changes were made?
  • [ ] Do the changes respect streaming IO? (Are they tested for streaming IO?)

standage avatar Aug 31 '17 16:08 standage

This depends on consume_seqfile_banding_with_reads_parser from #1753.

standage avatar Aug 31 '17 16:08 standage

Some initial benchmarking this weekend: ran abund-dist-single.py with & without banding (--banding 8 1) and with & without threading (--threads 4). Banding does provide a runtime performance improvement, but it appears that the improvement doesn't stack with improvements from threading.

standage avatar Nov 13 '17 21:11 standage

Hmm. The cyclic hash is not only faster than the murmur hash (we already knew that), but its speed benefits seem to stack better with threading. The commands I used:

time scripts/abundance-dist-single.py --hash-function cyclic --max-memory-usage 25M --ksize 25 ~/Desktop/kevlar/scratch/nano-trio/proband-reads.fq.gz nobanding-nothreads.abunddist
time scripts/abundance-dist-single.py --hash-function cyclic --banding 8 1 --max-memory-usage 25M --ksize 25 ~/Desktop/kevlar/scratch/nano-trio/proband-reads.fq.gz banding-nothreads.abunddist
time scripts/abundance-dist-single.py --hash-function cyclic --threads 4 --max-memory-usage 25M --ksize 25 ~/Desktop/kevlar/scratch/nano-trio/proband-reads.fq.gz nobanding-threads.abunddist
time scripts/abundance-dist-single.py --hash-function cyclic --threads 4 --banding 8 1 --max-memory-usage 25M --ksize 25 ~/Desktop/kevlar/scratch/nano-trio/proband-reads.fq.gz banding-threads.abunddist

standage avatar Nov 13 '17 22:11 standage