Should EveryKmer and SpacedKmers be merged?
I think there could be an argument to merge them into one iterator. We may need to benchmark, but, The current iteration implementation for EveryKmer when the kmer-longseq alphabets match is:
@inline function Base.iterate(it::EveryKmer{Kmer{A,K,N},LongSequence{A}}, state) where {A,K,N}
i, fwkmer = state
i += 1
if i > it.stop
return nothing
else
bps = BioSequences.bits_per_symbol(A()) # Based on type info, should constant fold.
bits = UInt64(BioSequences.extract_encoded_element(it.seq, i))
kmer = leftshift_carry(fwkmer, bps, bits)
pos = i - K + 1
return (pos, Kmer{A,K,N}(kmer)), (i, kmer)
end
end
For SpacedKmers
it is:
@inline function Base.iterate(it::SpacedKmers{Kmer{A,K,N},LongSequence{A}}, state) where {A,K,N}
i, kmer = state
filled = it.filled
i += it.increment
for _ in filled:K-1
if i > it.stop
return nothing
else
bps = BioSequences.bits_per_symbol(A()) # Based on type info, should constant fold.
bits = UInt64(BioSequences.extract_encoded_element(it.seq, i))
kmer = leftshift_carry(kmer, bps, bits)
i += 1
end
end
pos = i - K + 1
return (pos, Kmer{A,K,N}(kmer)), (i, kmer)
end
These are almost identical, and indeed for the case where step were allowed to be 1, the outer for loop would do one pass, as filled would be K-1.
This would slash the code and number of types in half.
I think having EveryKmer with an optional space argument or something (that just defaults to 1) would be fine :shrug: . I'm all for simplifying
It's just occurred to me in the Refactor of SpacedKmers the code above has a minor bug I need to fix.
I'll have to benchmark to make sure that in practical uses there isn't some hidden performance benefit to using EveryKmer, like simpler iteration methods being easier to optimise.
I'm sceptical you can make the SpacedKmerIterator be as optimal as the EveryKmerIterator, but maybe you can.
If you can, then it's a clear complexity reduction which is nice and you should go for it.
What is the proposed interface? Still have EveryKmerIterator, just have it alias SpacedKmer with a space of 1? That sounds good. We can even add a docstring saying that the type is not guaranteed to be aliased in the future and may become its own type.