BioSequences.jl
BioSequences.jl copied to clipboard
How should one search for ORFs?
CC @camilogarciabotero, ref. #297
How would a user go about findings ORFs with BioSequences? To do this, they need to be able to do several fundamental operations that BioSequences currently doesn't support:
-
BioRegex
doesn't support mostfind*
functions, likefindnext
andfindall
. - There is no way to search for an amino acid in a nucseq, given a translation table.
- There is no way to search only in-frame.
It's possible this needs to be added in Kmers.jl, since Kmers contain the RNACodon type, the CodonSet type, and reverse translation. Kmers.jl is currently in an incomplete state, but I'm slowly working rewriting it.
So, my current proposal is:
- Implement the
find*
functions forBioRegex
. - In Kmers, allow searching for a CodonSet in a nucseq.
Maybe we should include a convenience function in Kmers that is something like this:
function find_next(s::BioSequence, codons::CodonSet, from::Index)
v = view(s, from:lastindex(s))
for (step, kmer) in enumerate(SpacedKmers{RNAAlphabet{2}, 3, 3}(v))
kmer ∈codons && return 3 * (step-1) + from
end
nothing
end
We should definitely support find*
, and the idea of in-frame searching would indeed be cool.
Maybe we should include a convenience function in Kmers
findnext
is alread in base, is there a reason we woudn't extend that?
Sorry, I wasn't being clear. We should definitely extend the find*
methods for bioregex, regardless of this issue.
To search for stop codons or amino acids in a nucleotide sequence, I think we need the Codon
, CodonSet
reverse translation, and SpacedKmers
functionality from Kmers.jl
, so we might want to implement the whole "search for codon thing" there.
I don't super love the distinction - really, codons and reverse translation has little to do with kmers and should maybe belong in this package, but implementation wise it's just soooo much nicer to have codons be implemented as 3-mers.
I mean, codons seem like 3-mers to me :shrug: - actually, they might be the O.G. Kmer :laughing:
I suppose my questions would be
- Is Kmers.jl going to be sufficiently low-level that it makes sense for this package to depend on it?
- Is there a way to do a naive implementation of search w/o Kmers, in a way that can be made more efficient / ergonomic long-term once Kmers.jl is done?
Thanks for opening the issue out of the PR, I did have my doubts about the implementation and where to put it. Finding that there were other predicates in BioSequences
looked to me like it was a suitable place, but it indeed is very idiosyncratic for broad use.
This hasprematurestop
method came up first (now I'm using it somewhere else...) from the ORF library ( GeneFinder.jl
) I've been working on. And I did have these exact issues you mentioned when trying to implement a findorf
method:
BioRegex doesn't support most find* functions, like findnext and findall. There is no way to search for an amino acid in a nucseq, given a translation table. There is no way to search only in-frame.
The implementation I ended up with was a little cumbersome: It first creates an iterator of a BioRegex
that match a complex ORF:
function locationiterator(sequence::LongNucOrView{N}; alternative_start::Bool = false) where N
regorf = alternative_start ? biore"DTG(?:[N]{3})*?T(AG|AA|GA)"dna : biore"ATG(?:[N]{3})*?T(AG|AA|GA)"dna
finder(x) = findfirst(regorf, sequence, first(x) + 1) # + 3
itr = takewhile(!isnothing, iterated(finder, findfirst(regorf, sequence)))
return itr
end
And then traverse the sequence looking for the iterator match in both strands of the sequence and stores the ORFs that match the BioRegex
:
function findorfs(sequence::LongNucOrView{N}; alternative_start::Bool = false, min_len::Int64 = 6) where N
orfs = Vector{ORF}()
reversedseq = reverse_complement(sequence)
seqlen = length(sequence)
frames = Dict(0 => 3, 1 => 1, 2 => 2)
for strand in ('+', '-')
seq = strand == '-' ? reversedseq : sequence
@inbounds for location in locationiterator(seq; alternative_start)
if length(location) >= min_len
frame = strand == '+' ? frames[location.start % 3] : frames[(seqlen - location.stop + 1) % 3]
push!(orfs, ORF(strand == '+' ? location : (seqlen - location.stop + 1):(seqlen - location.start + 1), strand, frame))
end
end
end
return sort(orfs)
end
It takes care of the frame issue by simply calculating the module of the start position of each match (location.start % 3
). It is now working fine for most ORF cases. But it allocates tones of memory. The idea to have find_next
and the CodonSet
as well as a SpacedKmer
would definitely improve the way in which this implementation could go.
Another alternative I was thinking about (but I lack the expertise) is to create a better Regex
and create a finite automaton that parses it and then stores the matches (I actually don't know if it is viable). But the problem is that the BioRegex
uses non-deterministic expression (*?
) that operates lazily to capture even the smallest ORF possible (M*
).
Please take a look here: https://github.com/Roleren/ORFik/blob/master/src/findOrfsFasta.cpp
Its well tested implementation and very fast at the same time.