WFA icon indicating copy to clipboard operation
WFA copied to clipboard

The `static-band` heuristic returns different alignments when it shouldn't with both `gap-affine2p-wfa` and `gap-affine-wfa`

Open AndreaGuarracino opened this issue 9 months ago • 1 comments

Using gap-affine2p-wfa with sequences in x.zip, I get the following optimal exact alignment

align_benchmark -a gap-affine2p-wfa --affine2p-penalties 0,3,4,2,24,1 --wfa-memory high -i x.seq --wfa-heuristic none --output o 2> /dev/null && cat o
-1414   2X1M1I3M3X1M1X1M106D6M1X4M2X1M1X3M1X3M1X1M1X1M1X1M1X1M3X1M1X1M1X1M1X1M1X9M1X1M1X12M1X9M1X10M1X3M1X7M4I292M2I7250M1X284M9D113M2X220M1X1536M1X1498M1X1469M1X3988M1X2560M2D1323M1X180M1X5316M1X552M3D247M2D7819M1X111M1X3M1X14M1X6M1X30M1X8M1X5M1X7M1X1M1X17M1X4M2X2M1X22M1X11M1X5M2X15M1X1M1X1M1X14M1X4M1X13M1X26M1X24M1X7M1X5M1X30M2X2M2X8M1X12M1X10M1X103M1X44M1X26M1X28M2X5M1D2M1X15M1X3M2X4M1X14M1X35M1X4M1X5M1X5M1X1M2X14M1X9M1X5M1X5M1X3M1X4M1X1M1X10M1X3M1X21M1X17M1X1M1I7M1X5M1X6M1X4M1X4M3I2M1X2M1X6M1X8M1X14M1X15M1X5M1X4M1X5M1X14M1X2M1X22M1X8M3X4M1X30M1X24M1X8M1X19M1X6M1X13M1X20M1X19M1X18M1X9M1X24M1X7M1X2M1X9M2X32M1X21M1X30M1X18M1X11M1X20M1X3M1X7M1X27M1X23M1X1M3X22M1X4M1X24M1X13M699I2X

which has min_diag, max_diag = (-116, 587), so I tried

align_benchmark -a gap-affine2p-wfa --affine2p-penalties 0,3,4,2,24,1 --wfa-memory high -i x.seq --wfa-heuristic banded-static --wfa-heuristic-parameters -116,587 --output o 2> /dev/null && cat o
-1414   2X1M1I3M3X1M1X1M106D6M1X4M2X1M1X3M1X3M1X1M1X1M1X1M1X1M3X1M1X1M1X1M1X1M1X9M1X1M1X12M1X9M1X10M1X3M1X7M4I292M2I7250M1X284M9D113M2X220M1X1536M1X1498M1X1469M1X3988M1X2560M2D1323M1X180M1X5316M1X552M3D247M2D7819M1X111M1X3M1X14M1X6M1X30M1X8M1X5M1X7M1X1M1X17M1X4M2X2M1X22M1X11M1X5M2X15M1X1M1X1M1X14M1X4M1X13M1X26M1X24M1X7M1X5M1X30M2X2M2X8M1X12M1X10M1X103M1X44M1X26M1X28M2X5M1D2M1X15M1X3M2X4M1X14M1X35M1X4M1X5M1X5M1X1M2X14M1X9M1X5M1X5M1X3M1X4M1X1M1X10M1X3M1X21M1X17M1X1M1I7M1X5M1X6M1X4M1X4M3I2M1X2M1X6M1X8M1X14M1X15M1X5M1X4M1X5M1X14M1X2M1X22M1X8M3X4M1X30M1X24M1X8M1X19M1X6M1X13M1X20M1X19M1X18M1X9M1X24M1X7M1X2M1X9M2X32M1X21M1X30M1X18M1X11M1X20M1X3M1X7M1X27M1X23M1X1M3X22M1X4M1X24M1X13M1X699I1X

Getting a different alignment, but with the same score. was expecting the same CIGAR string.

To get the correct alignment, I had to run min_diag-3,max_diag+0

align_benchmark -a gap-affine2p-wfa --affine2p-penalties 0,3,4,2,24,1 --wfa-memory high -i x.seq --wfa-heuristic banded-static --wfa-heuristic-parameters -119,587 --output o 2> /dev/null && cat o
-1414   2X1M1I3M3X1M1X1M106D6M1X4M2X1M1X3M1X3M1X1M1X1M1X1M1X1M3X1M1X1M1X1M1X1M1X9M1X1M1X12M1X9M1X10M1X3M1X7M4I292M2I7250M1X284M9D113M2X220M1X1536M1X1498M1X1469M1X3988M1X2560M2D1323M1X180M1X5316M1X552M3D247M2D7819M1X111M1X3M1X14M1X6M1X30M1X8M1X5M1X7M1X1M1X17M1X4M2X2M1X22M1X11M1X5M2X15M1X1M1X1M1X14M1X4M1X13M1X26M1X24M1X7M1X5M1X30M2X2M2X8M1X12M1X10M1X103M1X44M1X26M1X28M2X5M1D2M1X15M1X3M2X4M1X14M1X35M1X4M1X5M1X5M1X1M2X14M1X9M1X5M1X5M1X3M1X4M1X1M1X10M1X3M1X21M1X17M1X1M1I7M1X5M1X6M1X4M1X4M3I2M1X2M1X6M1X8M1X14M1X15M1X5M1X4M1X5M1X14M1X2M1X22M1X8M3X4M1X30M1X24M1X8M1X19M1X6M1X13M1X20M1X19M1X18M1X9M1X24M1X7M1X2M1X9M2X32M1X21M1X30M1X18M1X11M1X20M1X3M1X7M1X27M1X23M1X1M3X22M1X4M1X24M1X13M699I2X

Similar story with sequences in y.zip

align_benchmark -a gap-affine2p-wfa --affine2p-penalties 0,3,4,2,24,1 --wfa-memory high -i y.seq --wfa-heuristic none --output o 2> /dev/null && cat o
-738    514M1D366M1X104M1X92M1X197M1X261M3I664M1X126M1X17M1X572M1X525M1X234M5D320M1X584M1X221M1X64M1X231M1X971M1X586M1X324M1X305M1X344M1X38M1X14M1X109M1X283M1X211M1X1011M1X77M1X33M1X136M1X1190M1X779M1X302M1X420M1X539M1X317M1X2M1X782M1X425M1X496M1X35M1X723M1X146M1X77M1X454M1X16M2D8M1X146M2X14M1X208M158I15M1X6M1X4M2X5M1X2M1X3M2X3M2X1M1X22M1I2M1X6M2X2M1X1M1X7M3X6M2X1M1X4M1X10M2X25M1D6M5D10M1X1M2I4M2X1M235D1M

min_diag, max_diag = (-85, 154)

# min_diag, max_diag <-- NOPE: different CIGAR string and different score
align_benchmark -a gap-affine2p-wfa --affine2p-penalties 0,3,4,2,24,1 --wfa-memory high -i y.seq --wfa-heuristic banded-static --wfa-heuristic-parameters -85,154 --output o 2> /dev/null && cat o
-740    514M1D366M1X104M1X92M1X197M1X261M3I664M1X126M1X17M1X572M1X525M1X234M5D320M1X584M1X221M1X64M1X231M1X971M1X586M1X324M1X305M1X344M1X38M1X14M1X109M1X283M1X211M1X1011M1X77M1X33M1X136M1X1190M1X779M1X302M1X420M1X539M1X317M1X2M1X782M1X425M1X496M1X35M1X723M1X146M1X77M1X454M1X16M2D8M1X146M2X14M1X208M158I15M1X6M1X4M2X5M1X2M1X3M2X3M2X1M1X22M1I2M1X6M2X2M1X1M1X7M3X6M2X1M1X4M1X10M2X25M1D6M5D10M228D1M4X6M5D1M

# min_diag - 1, max_diag + 1 <-- NOPE
# min_diag - 2, max_diag + 2 <-- NOPE
#...
# min_diag - 47, max_diag + 47 worked in getting the same CIGAR string
./WFA2-lib/bin/align_benchmark -a gap-affine2p-wfa --affine2p-penalties 0,3,4,2,24,1 --wfa-memory high -i y.seq --wfa-heuristic banded-static --wfa-heuristic-parameters -132,201 --output o 2> /dev/null && cat o
-738    514M1D366M1X104M1X92M1X197M1X261M3I664M1X126M1X17M1X572M1X525M1X234M5D320M1X584M1X221M1X64M1X231M1X971M1X586M1X324M1X305M1X344M1X38M1X14M1X109M1X283M1X211M1X1011M1X77M1X33M1X136M1X1190M1X779M1X302M1X420M1X539M1X317M1X2M1X782M1X425M1X496M1X35M1X723M1X146M1X77M1X454M1X16M2D8M1X146M2X14M1X208M158I15M1X6M1X4M2X5M1X2M1X3M2X3M2X1M1X22M1I2M1X6M2X2M1X1M1X7M3X6M2X1M1X4M1X10M2X25M1D6M5D10M1X1M2I4M2X1M235D1M

I also get this behaviour with gap-affine-wfa with sequences in and z.zip.

AndreaGuarracino avatar Mar 04 '25 13:03 AndreaGuarracino

We've been able to reproduce the issue and are investigating the root cause of the problem. In the meantime, I've created a patch that provides a temporary fix while we work on a more definitive solution:

cd WFA2-lib/
curl https://gist.githubusercontent.com/quim0/36a7f1a1c0f52d396e61eec94408cc46/raw/02b118bee2b9b3c6e690ae82f22650b07c719ad5/gistfile1.txt > fix.patch
git apply fix.patch

quim0 avatar Mar 06 '25 13:03 quim0

@AndreaGuarracino Can you check whether this issue is fully resolved by the last commit (without the temporal patch)?

smarco avatar Apr 23 '25 13:04 smarco

I still have problems. I've tested over 5 million cases, and I report here 233 failing tests.

  • penalties: 5 (mismatch), 8 (gap_open1), 2 (gap_ext1), 24 (gap_open2), 1 (gap_ext2)
  • CIGAR from tracepoints is the CIGAR string obtained with WFA2-lib without heuristics.
  • CIGAR from single_band_tracepoints is the CIGAR string obtained with WFA2-lib and a symmetric static band -X-1,+X+1 (X is the last number of the single_band_tracepoints row)
  • CIGAR from double_band_tracepoints is the CIGAR string obtained with WFA2-lib and a not symmetric static band -X-1,+Y+1 (X and Y are the last numbers of the double_band_tracepoints row).
  • seqa and seqb are the sequences that triggered the bug.

wfa2lib-static-band-bugs.report.zip

AndreaGuarracino avatar Jun 04 '25 15:06 AndreaGuarracino

A not nice temporary solution to avoid the above bugs is to increase the band not by -X-1,+X+1 or -X-1,+Y+1 but by -X-108,+X+108 or -X-108,+Y+108.

AndreaGuarracino avatar Jun 04 '25 15:06 AndreaGuarracino

Is there any super good news about this?

AndreaGuarracino avatar Sep 03 '25 17:09 AndreaGuarracino