Kmers.jl icon indicating copy to clipboard operation
Kmers.jl copied to clipboard

Proposal: Allow zero-length kmers

Open jakobnissen opened this issue 3 years ago • 5 comments

Currently, one can't make 0-length kmers:

julia> DNAKmer(dna"")
ERROR: ArgumentError: Bad kmer parameterisation. K must be greater than 0.
Stacktrace:
 [1] checkmer(#unused#::Type{DNAKmer{0, 0}})
   @ Kmers ~/.julia/packages/Kmers/7SNBQ/src/kmer.jl:414

I'm not sure I get the rationale for that. Sure, length 0 kmers are a little weird, but in general, containers in Julia can be length 0, That is, we have length 0 LongSequence, LongSubSeq, Vector, Set, Tuple etc etc. I think it would be nicer to just allow it.

jakobnissen avatar Jun 10 '22 07:06 jakobnissen

Hmm - but I think of a Kmer as closer to Char than String, and you can't have empty Char. How would you iterate a sequence with each 0-length kmer, for example?

kescobo avatar Jun 10 '22 14:06 kescobo

I guess it depends on whether you consider Kmers as just LongSequences - i.e. ordered containers of BioSymbols - just optimised for a specific purpose. Or almost more of a BioSymbol itself, indeed you can consider a LongSequence as a container of kmers, as well as nucleotides. I think of them kinda as both, to be honest. As for the iterating over 0-length kmers, I wonder if we can take inspiration from the julia ecosystem - patterns of iterating over substrings or views of an array or similar?

TransGirlCodes avatar Jun 11 '22 11:06 TransGirlCodes

Or almost more of a BioSymbol itself, indeed you can consider a LongSequence as a container of kmers, as well as nucleotides. I think of them kinda as both,

Actually, maybe the best comparison with the string ecosystem is a regular expression rather than a Char.

julia> st = "hello banana"
"hello banana"

julia> findall(r"ba", st)
1-element Vector{UnitRange{Int64}}:
 7:8

julia> findall(r"", st)
13-element Vector{UnitRange{Int64}}:
 1:0
 2:1
 3:2
 4:3
 5:4
 6:5
 7:6
 8:7
 9:8
 10:9
 11:10
 12:11
 13:12

I suppose this would argue in favor of 0-mers :shrug:, but I don't really like it. I would have thought r"" would throw an error...

kescobo avatar Jun 11 '22 16:06 kescobo

I definitely see Kmer as "just another BioSequence" - and I see the characteristics of Kmers, namely their immutability and fixed lengths - to be essentially implementation details. I.e. if it was possible to produce just as efficient code using LongSequence, I don't know why I would ever use Kmer.

My analogy is that BioSequence is like AbstractVector, LongSequence is like Vector, LongSubSeq is like SubArray{T, 1} and Kmer is like StaticVector. Though not literally, of course, as we decided, BioSequence is not actually an AbstractVector.

Or, if you will, it corresponds to AbstractString, String, SubString, and InlineString, respectively.

That is, the different sequence types are only different due to computer-sciency implementation details like whether they are stack-allocated or not, IMO they should not be "biologically" different, and, when possible, they should try to behave identically with each other, such that one can make generic code that takes BioSequence, and then plug whatever subtype in it you want.

If kmers were Char-like, in my opinion, that would mean they were "atomic" primitive values, i.e. they did not contain elements (other than themselves, possibly).

jakobnissen avatar Jun 13 '22 07:06 jakobnissen

I definitely disagree with myself from 3 days ago about the Char thing.

StaticVector is a good analogy too, I suppose. In any case, there are enough analogies that implement the empty form that I think we should probably allow 0mers for consistency, even if it makes me grumpy :shrug:

kescobo avatar Jun 13 '22 13:06 kescobo