edlib icon indicating copy to clipboard operation
edlib copied to clipboard

Allow constrained pairwise alignment

Open rob-p opened this issue 7 years ago • 6 comments

Hi,

I'm very interested in using edlib in some of the projects we're working on. One feature that would be great to have is the ability to build an alignment around a set of fixed constraints. Specifically, I'm interested in the case where I have one or more seeds (exact matches) known to be shared between the query and the target, and I'm interested in the optimal alignment given that it contains these seeds. How difficult would a capability like this be to add to edlib?

Thanks! Rob

rob-p avatar Feb 21 '17 18:02 rob-p

Hi @rob-p, I am glad that you find Edlib useful :)! I would to hear more about the projects and how would Edlib help you, that is always super interesting to hear.

Ok, let me make sure that I understood you well. Seeds are defined with their exact positions in both query and target? For e.g., if I have query q and target t, and I am given a seed s of length ls which starts in q at position sq and starts in t at position st, then we are looking for the optimal alignment such that q[sq:sq+ls] is perfectly aligned (only matches) with t[st:st+ls]?

Illustration on dynamic programming matrix: alignment-seeds

In this example red would be normal optimal alignment, without seeds, while green would be optimal alignment that passes through seed s (blue).

Of course, there may be multiple seeds. Is this what we are talking about? For each seed, we know its length and exact position in both query and target?

Martinsos avatar Feb 21 '17 21:02 Martinsos

Hi @Martinsos,

Well, I'm interested in potentially adding some alignment functionality to our transcriptome mapping tool, and potentially using that information downstream in our transcript quantification tool. We have some ideas how to make use of the selective full alignment of some (but not all) sequenced reads.

Your understanding of what I was describing is exactly right. Given the query and target, there will be an array of exact matches (potentially of different sizes and potentially along different diagonals). Each match will represent an exactly shared substring between the query and the reference, and we'd be interested in an alignment that is constrained to contain these matches.

rob-p avatar Feb 22 '17 21:02 rob-p

Those are cool tools, I will make sure to give them a better look and I hope you will find Edlib useful for them :).

Regarding the feature, I believe that the best way to find such optimal alignment that includes given seeds is to just split query and target on pieces regarding seeds and then align each of those (query_piece, target_piece) pairs independently and combine results.

Example - let's say that we have two seeds, s1 and s2 of some length, then we can represent query and target as:

t = t1 s1 t2 s2 t3
q = q1 s1 q2 s2 q3

Now, we would call edlib align() for each of pairs: [(q1, t1), (q2, t2), (q3, t3)]. Total edit distance is just sum of those 3 edit distances, while total optimal alignment path is concatenation of those three with inserted correct number of matches for each seed.

If I was not clear regarding this explanation please ask me to elaborate!

Taking into consideration how simple this is to implement, I feel like it may be better not to add it as a feature to Edlib, as it would make it more complicated while not giving a lot of value. However, I may not be correct - what do you think?

I forgot to ask, we are talking about global alignment here (NW in edlib), correct? Not prefix (SHW) or infix (HW)?

Martinsos avatar Feb 22 '17 21:02 Martinsos

Let me think a little bit about the splitting claim. But to answer your question about the alignment type, I'm thinking both about global alignment and alignment allowing for soft clipping (which I believe is infix in edlib terminology).

I also had another, slightly-unrelated, feature request / performance thought. Since my applications deal mostly with the alignment of many short sequence query pairs rather than some smaller number of very large alignments, it seems that there are a few non-alignment-related things that might end up taking quite a bit of time. Specifically, these are the repeated allocation and deallocation of the internal alignment objects (having some kind of buffer, or the ability to pass in memory to be used for the alignment task would be an ideal way around this) and the potentially repeated transformation of the query and target strings into arrays of integers (I assume the internal implementation makes it impossible to perform the alignment directly on e.g. ASCII encoded strings). I realize that addressing these might not really be within the ideal scope of edlib, which seems geared more toward the efficient pairwise alignment of long sequences. However, it would be great to have access to edlib's efficient internal algorithms for the task of aligning many short reads (or parts thereof).

rob-p avatar Feb 22 '17 22:02 rob-p

For global, splitting is simple -> just split and run global align on each piece.

For infix, splitting is a little bit more complicated, but still doable: you would use prefix align for first piece, global for those in the middle, and the last you would first inverse and then use prefix align on it. Edit distance is again just sum of edit distances of pieces, alignment path is just a little bit tricky because you have to inverse the alignment path of last piece before concatenating them, but that is also not hard actually.

For prefix, it would be prefix align for first piece and global for the rest of pieces.

It actually all just comes down to calling edlib.align() multiple times and then combining results. But I see how this may be a problem if one does not know that this can be done this way, so adding it as a feature may make sense hm (I would implement it the same way as I described)! How useful do you think this feature would be, is alignment with seeds something often needed in your opinion?

For starters, I could give you a hand (if needed) with writing a piece of code to handle the alignment with seeds, and if that turns out useful I could add it to Edlib as a feature (which demands some more work since it has to work for all three alignment methods).

Wow this is a long answer :D :potato: . Let me think a little bit about the rest of your question (edlib for short reads) and I will answer it somewhat later if that is ok. In the mean time, when you say shorter reads, what lengths would that be, approx.?

Martinsos avatar Feb 22 '17 22:02 Martinsos

Regarding alignment of many short sequences -> is it many short pairs (q1, t1), (q2, t2) or many short queries all on the same target (q1, t), (q2, t), and are both query and target short, or just query?

You are right that Edlib is less efficient for short sequences (< 100bp) than for the longer ones.

Also, I agree that "housekeeping jobs" (allocation/delocation, input preparation, ...) certainly take bigger portion of time for short sequences, and that it could be possible to reduce total time spent on housekeeping in case when we have many short sequences if all those short sequences are processed in a batch.
What I am not sure about is how much time is housekeeping actually taking - it may be so little that it is not worth optimizing it. I should certainly do some profiling on this, to see what is happening.

Another thing is that Edlib's algorithm by its nature is always about to loose part of its "power" if query is shorter than 64 (on 64-bit computers). This happens because Edlib aligns query to target in chunks of 64 residues from the query - so it takes first 64 residues from the query, does some calculation with the target, moves to the second 64 residues from the query and so on. When calculating those 64 at once, big speedup is achieved because of parallelization - while some other algorithm processes 1 residues, Edlib processes 64 (of course reduced with some factor because of housekeeping, but ideally 64).
Because of this, I would say that the best approach to improving Edlib's speed for shorter sequences is to actually implement another algorithm, that is more appropriate for short sequences, and then have Edlib internally decide which one to use based on the sequence size.

This is something I would certainly like to do in the future, but since it is a lot of work it may take some time for me to get to it. I just opened an issue for it (#77).

Martinsos avatar Feb 23 '17 08:02 Martinsos