cutadapt
cutadapt copied to clipboard
Indexing improvements
See #685
- [x] Log how long it took to create an index
- [ ] Log how many adapter sequences were not included in the index (and possibly why)
- [x] With
--debug
, do not log all adapters, only the first 10 or so - [ ] Relax the criterion for inclusion: For Hamming distance, allow up to three errors (benchmark this)
- [x] Move indexing to Cython
- [ ] Encode strings as integers
- [x] When an
N
appears in the query, do not fall back to matching each adapter. - [ ]
edit_environment
could stop early when all entries in a row are equal to k - [ ] Above a certain threshold, it may make sense to not store all possible strings that have edit/hamming distance <=k in the index, but to split this up. For example, store those with k=0 and k=1 in the index and then generate those with edit/hamming distance 1 from each query, then look each up individually, which will then get us to k=2.