Kmers.jl icon indicating copy to clipboard operation
Kmers.jl copied to clipboard

Should EveryKmer and SpacedKmers be merged?

Open TransGirlCodes opened this issue 3 years ago • 4 comments

I think there could be an argument to merge them into one iterator. We may need to benchmark, but, The current iteration implementation for EveryKmer when the kmer-longseq alphabets match is:

@inline function Base.iterate(it::EveryKmer{Kmer{A,K,N},LongSequence{A}}, state) where {A,K,N}
    i, fwkmer = state
    i += 1
    if i > it.stop
        return nothing
    else
        bps = BioSequences.bits_per_symbol(A()) # Based on type info, should constant fold.
        bits = UInt64(BioSequences.extract_encoded_element(it.seq, i))
        kmer = leftshift_carry(fwkmer, bps, bits)
        pos = i - K + 1
        return (pos, Kmer{A,K,N}(kmer)), (i, kmer)
    end
end

For SpacedKmers

it is:

@inline function Base.iterate(it::SpacedKmers{Kmer{A,K,N},LongSequence{A}}, state) where {A,K,N}
    i, kmer = state
    filled = it.filled
    i += it.increment
    
    for _ in filled:K-1
        if i > it.stop
            return nothing
        else
            bps = BioSequences.bits_per_symbol(A()) # Based on type info, should constant fold.
            bits = UInt64(BioSequences.extract_encoded_element(it.seq, i))
            kmer = leftshift_carry(kmer, bps, bits)
            i += 1
        end
    end
    pos = i - K + 1
    return (pos, Kmer{A,K,N}(kmer)), (i, kmer)
end

These are almost identical, and indeed for the case where step were allowed to be 1, the outer for loop would do one pass, as filled would be K-1.

This would slash the code and number of types in half.

TransGirlCodes avatar Jun 08 '22 15:06 TransGirlCodes

I think having EveryKmer with an optional space argument or something (that just defaults to 1) would be fine :shrug: . I'm all for simplifying

kescobo avatar Jun 08 '22 15:06 kescobo

It's just occurred to me in the Refactor of SpacedKmers the code above has a minor bug I need to fix.

TransGirlCodes avatar Jun 09 '22 10:06 TransGirlCodes

I'll have to benchmark to make sure that in practical uses there isn't some hidden performance benefit to using EveryKmer, like simpler iteration methods being easier to optimise.

TransGirlCodes avatar Jun 09 '22 11:06 TransGirlCodes

I'm sceptical you can make the SpacedKmerIterator be as optimal as the EveryKmerIterator, but maybe you can. If you can, then it's a clear complexity reduction which is nice and you should go for it.

What is the proposed interface? Still have EveryKmerIterator, just have it alias SpacedKmer with a space of 1? That sounds good. We can even add a docstring saying that the type is not guaranteed to be aliased in the future and may become its own type.

jakobnissen avatar Jun 15 '22 07:06 jakobnissen