cutadapt icon indicating copy to clipboard operation
cutadapt copied to clipboard

Indexing improvements

Open marcelm opened this issue 1 year ago • 0 comments

See #685

  • [x] Log how long it took to create an index
  • [ ] Log how many adapter sequences were not included in the index (and possibly why)
  • [x] With --debug, do not log all adapters, only the first 10 or so
  • [ ] Relax the criterion for inclusion: For Hamming distance, allow up to three errors (benchmark this)
  • [x] Move indexing to Cython
  • [ ] Encode strings as integers
  • [x] When an N appears in the query, do not fall back to matching each adapter.
  • [ ] edit_environment could stop early when all entries in a row are equal to k
  • [ ] Above a certain threshold, it may make sense to not store all possible strings that have edit/hamming distance <=k in the index, but to split this up. For example, store those with k=0 and k=1 in the index and then generate those with edit/hamming distance 1 from each query, then look each up individually, which will then get us to k=2.

marcelm avatar Mar 30 '23 07:03 marcelm