`seq` and `Kmer` are impractical for everyday use

Open inumanag opened this issue 5 years ago • 1 comments

On top of my head:

[ ] When loading seq via bio.FASTA, comparisons often fail because s'a' != s'A' (and most FASTAs are soft-masked and thus contain loads of lowercase letters). One has to go over this by doing seq = seq(str(seq).upper()).
[ ] seq = str does not work
[ ] seq1 + seq1 does not work
[ ] seq1 + str1 does not work
[ ] How do you get a k-mer from a sequence? k = Kmer[20](s)?
[ ] How do you get a sequence from a k-mer? I can get string via str(k), but not a sequence (seq(k) fails).
[ ] Many slicing operators do not work on seqs and Kmers greatly reducing their usability.

Mar 24 '20 20:03 inumanag

This is because sequence is just essentially a string right now internally. Maybe we should have more strict requirements on what can be included in a sequence (i.e. just IUPAC uppercase characters? -- that would require converting when we read sequence data from disk).

TBH I don't think + and = should be overloaded for seq+str -- they are different types and they should be treated differently IMO. If this is really needed then I think an explicit seq1 + seq(str2) is better -- just my opinion. seq1 + seq2 is something we could support pretty easily.

k = Kmer[20](s) is right for that. seq(k) to get a sequence from a k-mer is something we should probably add too.

We can also support more slices on seq. On Kmer it's a lot harder since slices change the type: e.g. k[:3] is of type Kmer[3] and k[:4] is of type Kmer[4] -- not sure what the best way to handle this is. Longer-term I'd prefer to unify k-mer types into a single type and have the compiler deduce and optimize cases where the k-mer length is constant.

Mar 25 '20 14:03 arshajii