BioSequences.jl icon indicating copy to clipboard operation
BioSequences.jl copied to clipboard

How should one search for ORFs?

Open jakobnissen opened this issue 1 year ago • 5 comments

CC @camilogarciabotero, ref. #297

How would a user go about findings ORFs with BioSequences? To do this, they need to be able to do several fundamental operations that BioSequences currently doesn't support:

  • BioRegex doesn't support most find* functions, like findnext and findall.
  • There is no way to search for an amino acid in a nucseq, given a translation table.
  • There is no way to search only in-frame.

It's possible this needs to be added in Kmers.jl, since Kmers contain the RNACodon type, the CodonSet type, and reverse translation. Kmers.jl is currently in an incomplete state, but I'm slowly working rewriting it.

So, my current proposal is:

  • Implement the find* functions for BioRegex.
  • In Kmers, allow searching for a CodonSet in a nucseq.

Maybe we should include a convenience function in Kmers that is something like this:

function find_next(s::BioSequence, codons::CodonSet, from::Index)
    v = view(s, from:lastindex(s))
    for (step, kmer) in enumerate(SpacedKmers{RNAAlphabet{2}, 3, 3}(v))
        kmer ∈codons && return 3 * (step-1) + from
    end
    nothing
end

jakobnissen avatar Oct 23 '23 09:10 jakobnissen

We should definitely support find*, and the idea of in-frame searching would indeed be cool.

Maybe we should include a convenience function in Kmers

findnext is alread in base, is there a reason we woudn't extend that?

kescobo avatar Oct 23 '23 13:10 kescobo

Sorry, I wasn't being clear. We should definitely extend the find* methods for bioregex, regardless of this issue. To search for stop codons or amino acids in a nucleotide sequence, I think we need the Codon, CodonSet reverse translation, and SpacedKmers functionality from Kmers.jl, so we might want to implement the whole "search for codon thing" there. I don't super love the distinction - really, codons and reverse translation has little to do with kmers and should maybe belong in this package, but implementation wise it's just soooo much nicer to have codons be implemented as 3-mers.

jakobnissen avatar Oct 23 '23 13:10 jakobnissen

I mean, codons seem like 3-mers to me :shrug: - actually, they might be the O.G. Kmer :laughing:

I suppose my questions would be

  1. Is Kmers.jl going to be sufficiently low-level that it makes sense for this package to depend on it?
  2. Is there a way to do a naive implementation of search w/o Kmers, in a way that can be made more efficient / ergonomic long-term once Kmers.jl is done?

kescobo avatar Oct 23 '23 13:10 kescobo

Thanks for opening the issue out of the PR, I did have my doubts about the implementation and where to put it. Finding that there were other predicates in BioSequences looked to me like it was a suitable place, but it indeed is very idiosyncratic for broad use.

This hasprematurestop method came up first (now I'm using it somewhere else...) from the ORF library ( GeneFinder.jl) I've been working on. And I did have these exact issues you mentioned when trying to implement a findorf method:

BioRegex doesn't support most find* functions, like findnext and findall. There is no way to search for an amino acid in a nucseq, given a translation table. There is no way to search only in-frame.

The implementation I ended up with was a little cumbersome: It first creates an iterator of a BioRegex that match a complex ORF:

function locationiterator(sequence::LongNucOrView{N}; alternative_start::Bool = false) where N
    regorf = alternative_start ? biore"DTG(?:[N]{3})*?T(AG|AA|GA)"dna : biore"ATG(?:[N]{3})*?T(AG|AA|GA)"dna
    finder(x) = findfirst(regorf, sequence, first(x) + 1) # + 3
    itr = takewhile(!isnothing, iterated(finder, findfirst(regorf, sequence)))
    return itr
end

And then traverse the sequence looking for the iterator match in both strands of the sequence and stores the ORFs that match the BioRegex:

function findorfs(sequence::LongNucOrView{N}; alternative_start::Bool = false, min_len::Int64 = 6) where N
    orfs = Vector{ORF}()
    reversedseq = reverse_complement(sequence)
    seqlen = length(sequence)

    frames = Dict(0 => 3, 1 => 1, 2 => 2)

    for strand in ('+', '-')
        seq = strand == '-' ? reversedseq : sequence

        @inbounds for location in locationiterator(seq; alternative_start)
            if length(location) >= min_len
                frame = strand == '+' ? frames[location.start % 3] : frames[(seqlen - location.stop + 1) % 3]
                push!(orfs, ORF(strand == '+' ? location : (seqlen - location.stop + 1):(seqlen - location.start + 1), strand, frame))
            end
        end
    end
    return sort(orfs)
end

It takes care of the frame issue by simply calculating the module of the start position of each match (location.start % 3). It is now working fine for most ORF cases. But it allocates tones of memory. The idea to have find_next and the CodonSet as well as a SpacedKmer would definitely improve the way in which this implementation could go.

Another alternative I was thinking about (but I lack the expertise) is to create a better Regex and create a finite automaton that parses it and then stores the matches (I actually don't know if it is viable). But the problem is that the BioRegex uses non-deterministic expression (*?) that operates lazily to capture even the smallest ORF possible (M*).

camilogarciabotero avatar Oct 23 '23 14:10 camilogarciabotero

Please take a look here: https://github.com/Roleren/ORFik/blob/master/src/findOrfsFasta.cpp

Its well tested implementation and very fast at the same time.

JokingHero avatar Dec 11 '23 09:12 JokingHero