BioSequences.jl icon indicating copy to clipboard operation
BioSequences.jl copied to clipboard

Demultiplexer scales badly with Hamming distance

Open tp2750 opened this issue 4 years ago • 0 comments

Background

I'm using Demultiplexer() to demultiplex nanopore reads. This works well, but when allowing more errors in the barcodes, the time to generate the demultiplexer grows very fast.

Current Behavior

Allowing one more error cost more than 10 times longer in terms of time and allocations.

Desired Behavior

It would be great if it was faster.

Steps to reproduce

julia> @time Demultiplexer(LongDNASeq.(["GGAGAAGAAGAAGAA"]), n_max_errors=1, distance=:hamming)
  0.000388 seconds (1.56 k allocations: 162.047 KiB)
Demultiplexer{LongSequence{DNAAlphabet{4}}}:
  distance: hamming
  number of barcodes: 1
  number of correctable errors: 1

julia> @time Demultiplexer(LongDNASeq.(["GGAGAAGAAGAAGAA"]), n_max_errors=2, distance=:hamming)
  0.010063 seconds (50.47 k allocations: 3.590 MiB)
Demultiplexer{LongSequence{DNAAlphabet{4}}}:
  distance: hamming
  number of barcodes: 1
  number of correctable errors: 2

julia> @time Demultiplexer(LongDNASeq.(["GGAGAAGAAGAAGAA"]), n_max_errors=3, distance=:hamming)
  0.193055 seconds (1.08 M allocations: 58.884 MiB)
Demultiplexer{LongSequence{DNAAlphabet{4}}}:
  distance: hamming
  number of barcodes: 1
  number of correctable errors: 3

julia> @time Demultiplexer(LongDNASeq.(["GGAGAAGAAGAAGAA"]), n_max_errors=4, distance=:hamming)
  3.394650 seconds (15.94 M allocations: 734.229 MiB, 10.49% gc time)
Demultiplexer{LongSequence{DNAAlphabet{4}}}:
  distance: hamming
  number of barcodes: 1
  number of correctable errors: 4

julia> @time Demultiplexer(LongDNASeq.(["GGAGAAGAAGAAGAA"]), n_max_errors=5, distance=:hamming)
 39.984839 seconds (169.53 M allocations: 7.118 GiB, 9.05% gc time)
Demultiplexer{LongSequence{DNAAlphabet{4}}}:
  distance: hamming
  number of barcodes: 1
  number of correctable errors: 5

My Environment

julia> versioninfo()
Julia Version 1.4.0
Commit b8e9a9ecc6 (2020-03-21 16:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, sandybridge)

julia> Pkg.status("BioSequences")
Status `~/.julia/environments/v1.4/Project.toml`
  [7e6ae17a] BioSequences v2.0.1

tp2750 avatar Jun 01 '20 18:06 tp2750