strobealign Chaining

This PR introduces collinear chaining as the new default method for mapping/aligning in Strobealign, replacing NAMs. NAMs are still supported and can be enabled using the --nams flag.

It adds new CLI parameters to fine-tune the chaining behavior: --nams Use NAMs instead of collinear chaining -H [INT] Chaining look-back heuristic (default: 50) --gd=[FLOAT] Diagonal gap cost (default: 0.1) --gl=[FLOAT] Gap length cost (default: 0.05) --vp=[FLOAT] Best chain score threshold (default: 0.7) --sg=[INT] Skip distance allowed on the reference

I should add that the CI pipeline currently fails because the output format and alignments produced by collinear chaining differ from those of NAMs, causing test assertions to break, and should be handled at some point.

Next Steps

[ ] Update test suite to reflect new chaining behavior
[ ] Document changes usage in CHANGES.md
[x] Benchmark performances comparing chaining vs NAMs
[ ] Update the names of some structs, since right now the chaining algorithm is reporting NAMs.
[x] Fix the reporting of chains that cover the same region of a reference

Jun 04 '25 09:06 NicolasBuchin

Awesome! Let’s get this into shape so it can be merged.

I’m looking at the failing tests at the moment. When I run tests/run.sh, I get:

Failure running 'diff tests/phix.se.sam phix.se.sam'

When I look at the diff, I can see that the alignments are actually all identical before and after but that the mapping quality changes. Whereas before, we had 42 alignments with quality 60 and 3 with quality 0, afterwards we get 24 with quality 60 and 21 with quality zero. Alignment quality 0 means that there are multiple equally good alignments. This may be an indication that multiple chains are found that result in the same alignment. Either we need to detect these duplicates and remove them or the chaining algorithm needs to be changed to not produce them in the first place. The latter would be preferable because computing alignments is expensive and these duplicates seem to happen quite often.

Jun 10 '25 19:06 marcelm

You are right, we do report chains that cover the same region on a reference.

I fixed it, by preventing reporting chains with an already used anchor. The tests should pass now.

Jun 11 '25 13:06 NicolasBuchin

You are right, we do report chains that cover the same region on a reference.

I fixed it, by preventing reporting chains with an already used anchor.

Nice, good that was a small fix.

The tests should pass now.

They don’t but the failure is somewhere else (in the PAF output). Don’t you run the tests locally? You should do that before every push.

Jun 12 '25 07:06 marcelm

Nice, good that was a small fix.

Communicated with Nicolas offline. The fix does not guarantee that the highest scoring chain is chosen. For example, if we have anchors [a_1,a_2,a_3,a_4,a_5,a_6] with the DP score [1,2,3,3,5,4], the current solution will report [a_3,a_4,a_6] (with score 4; over reporting threshold), but will then skip [a_3,a_4,a_5] with score 5. Nicolas is looking into solutions. One hack is to always start with reporting the optimal chain solution (we should have the DP index), then do a pass reporting all other non overlapping solutions.

Jun 12 '25 07:06 ksahlin

I have now rebased this PR on top of main and cleaned up the commit history as much as I could. I also ran all the tests for each commit and set the Is-new-baseline: yes trailer if necessary.

It is unfortunate that GitHub only shows my avatar next to the commits that Nicolas has authored, making it appear as if I were the author (maybe it is because Nicolas uses the default avatar). However, the commits themselves are correctly attributed to Nicolas as author (while I am the committer).

There are some unaddressed review comments that I will look into next before merging the PR.

The original history for this PR is in the branch chaining-original. I’ll delete that some time after merging if no one needs it.

Sep 01 '25 11:09 marcelm

very nice!

Sep 01 '25 12:09 ksahlin

I think this deserves to be merged now.

The chaining code as it is here is what we are running in our current benchmarks, so it makes sense to mark that somehow. I’ll open separate, smaller PRs for the few remaining issues.

@NicolasBuchin Great work, congratulations!

Sep 01 '25 13:09 marcelm

Great stuff @NicolasBuchin!

Sep 02 '25 04:09 ksahlin