`seq` and `Kmer` are impractical for everyday use
On top of my head:
- [ ] When loading
seqviabio.FASTA, comparisons often fail becauses'a' != s'A'(and most FASTAs are soft-masked and thus contain loads of lowercase letters). One has to go over this by doingseq = seq(str(seq).upper()). - [ ]
seq = strdoes not work - [ ]
seq1 + seq1does not work - [ ]
seq1 + str1does not work - [ ] How do you get a k-mer from a sequence?
k = Kmer[20](s)? - [ ] How do you get a sequence from a k-mer? I can get string via
str(k), but not a sequence (seq(k)fails). - [ ] Many slicing operators do not work on
seqs andKmers greatly reducing their usability.
This is because sequence is just essentially a string right now internally. Maybe we should have more strict requirements on what can be included in a sequence (i.e. just IUPAC uppercase characters? -- that would require converting when we read sequence data from disk).
TBH I don't think + and = should be overloaded for seq+str -- they are different types and they should be treated differently IMO. If this is really needed then I think an explicit seq1 + seq(str2) is better -- just my opinion. seq1 + seq2 is something we could support pretty easily.
k = Kmer[20](s) is right for that. seq(k) to get a sequence from a k-mer is something we should probably add too.
We can also support more slices on seq. On Kmer it's a lot harder since slices change the type: e.g. k[:3] is of type Kmer[3] and k[:4] is of type Kmer[4] -- not sure what the best way to handle this is. Longer-term I'd prefer to unify k-mer types into a single type and have the compiler deduce and optimize cases where the k-mer length is constant.