WFA
WFA copied to clipboard
The `static-band` heuristic returns different alignments when it shouldn't with both `gap-affine2p-wfa` and `gap-affine-wfa`
Using gap-affine2p-wfa with sequences in x.zip, I get the following optimal exact alignment
align_benchmark -a gap-affine2p-wfa --affine2p-penalties 0,3,4,2,24,1 --wfa-memory high -i x.seq --wfa-heuristic none --output o 2> /dev/null && cat o
-1414 2X1M1I3M3X1M1X1M106D6M1X4M2X1M1X3M1X3M1X1M1X1M1X1M1X1M3X1M1X1M1X1M1X1M1X9M1X1M1X12M1X9M1X10M1X3M1X7M4I292M2I7250M1X284M9D113M2X220M1X1536M1X1498M1X1469M1X3988M1X2560M2D1323M1X180M1X5316M1X552M3D247M2D7819M1X111M1X3M1X14M1X6M1X30M1X8M1X5M1X7M1X1M1X17M1X4M2X2M1X22M1X11M1X5M2X15M1X1M1X1M1X14M1X4M1X13M1X26M1X24M1X7M1X5M1X30M2X2M2X8M1X12M1X10M1X103M1X44M1X26M1X28M2X5M1D2M1X15M1X3M2X4M1X14M1X35M1X4M1X5M1X5M1X1M2X14M1X9M1X5M1X5M1X3M1X4M1X1M1X10M1X3M1X21M1X17M1X1M1I7M1X5M1X6M1X4M1X4M3I2M1X2M1X6M1X8M1X14M1X15M1X5M1X4M1X5M1X14M1X2M1X22M1X8M3X4M1X30M1X24M1X8M1X19M1X6M1X13M1X20M1X19M1X18M1X9M1X24M1X7M1X2M1X9M2X32M1X21M1X30M1X18M1X11M1X20M1X3M1X7M1X27M1X23M1X1M3X22M1X4M1X24M1X13M699I2X
which has min_diag, max_diag = (-116, 587), so I tried
align_benchmark -a gap-affine2p-wfa --affine2p-penalties 0,3,4,2,24,1 --wfa-memory high -i x.seq --wfa-heuristic banded-static --wfa-heuristic-parameters -116,587 --output o 2> /dev/null && cat o
-1414 2X1M1I3M3X1M1X1M106D6M1X4M2X1M1X3M1X3M1X1M1X1M1X1M1X1M3X1M1X1M1X1M1X1M1X9M1X1M1X12M1X9M1X10M1X3M1X7M4I292M2I7250M1X284M9D113M2X220M1X1536M1X1498M1X1469M1X3988M1X2560M2D1323M1X180M1X5316M1X552M3D247M2D7819M1X111M1X3M1X14M1X6M1X30M1X8M1X5M1X7M1X1M1X17M1X4M2X2M1X22M1X11M1X5M2X15M1X1M1X1M1X14M1X4M1X13M1X26M1X24M1X7M1X5M1X30M2X2M2X8M1X12M1X10M1X103M1X44M1X26M1X28M2X5M1D2M1X15M1X3M2X4M1X14M1X35M1X4M1X5M1X5M1X1M2X14M1X9M1X5M1X5M1X3M1X4M1X1M1X10M1X3M1X21M1X17M1X1M1I7M1X5M1X6M1X4M1X4M3I2M1X2M1X6M1X8M1X14M1X15M1X5M1X4M1X5M1X14M1X2M1X22M1X8M3X4M1X30M1X24M1X8M1X19M1X6M1X13M1X20M1X19M1X18M1X9M1X24M1X7M1X2M1X9M2X32M1X21M1X30M1X18M1X11M1X20M1X3M1X7M1X27M1X23M1X1M3X22M1X4M1X24M1X13M1X699I1X
Getting a different alignment, but with the same score. was expecting the same CIGAR string.
To get the correct alignment, I had to run min_diag-3,max_diag+0
align_benchmark -a gap-affine2p-wfa --affine2p-penalties 0,3,4,2,24,1 --wfa-memory high -i x.seq --wfa-heuristic banded-static --wfa-heuristic-parameters -119,587 --output o 2> /dev/null && cat o
-1414 2X1M1I3M3X1M1X1M106D6M1X4M2X1M1X3M1X3M1X1M1X1M1X1M1X1M3X1M1X1M1X1M1X1M1X9M1X1M1X12M1X9M1X10M1X3M1X7M4I292M2I7250M1X284M9D113M2X220M1X1536M1X1498M1X1469M1X3988M1X2560M2D1323M1X180M1X5316M1X552M3D247M2D7819M1X111M1X3M1X14M1X6M1X30M1X8M1X5M1X7M1X1M1X17M1X4M2X2M1X22M1X11M1X5M2X15M1X1M1X1M1X14M1X4M1X13M1X26M1X24M1X7M1X5M1X30M2X2M2X8M1X12M1X10M1X103M1X44M1X26M1X28M2X5M1D2M1X15M1X3M2X4M1X14M1X35M1X4M1X5M1X5M1X1M2X14M1X9M1X5M1X5M1X3M1X4M1X1M1X10M1X3M1X21M1X17M1X1M1I7M1X5M1X6M1X4M1X4M3I2M1X2M1X6M1X8M1X14M1X15M1X5M1X4M1X5M1X14M1X2M1X22M1X8M3X4M1X30M1X24M1X8M1X19M1X6M1X13M1X20M1X19M1X18M1X9M1X24M1X7M1X2M1X9M2X32M1X21M1X30M1X18M1X11M1X20M1X3M1X7M1X27M1X23M1X1M3X22M1X4M1X24M1X13M699I2X
Similar story with sequences in y.zip
align_benchmark -a gap-affine2p-wfa --affine2p-penalties 0,3,4,2,24,1 --wfa-memory high -i y.seq --wfa-heuristic none --output o 2> /dev/null && cat o
-738 514M1D366M1X104M1X92M1X197M1X261M3I664M1X126M1X17M1X572M1X525M1X234M5D320M1X584M1X221M1X64M1X231M1X971M1X586M1X324M1X305M1X344M1X38M1X14M1X109M1X283M1X211M1X1011M1X77M1X33M1X136M1X1190M1X779M1X302M1X420M1X539M1X317M1X2M1X782M1X425M1X496M1X35M1X723M1X146M1X77M1X454M1X16M2D8M1X146M2X14M1X208M158I15M1X6M1X4M2X5M1X2M1X3M2X3M2X1M1X22M1I2M1X6M2X2M1X1M1X7M3X6M2X1M1X4M1X10M2X25M1D6M5D10M1X1M2I4M2X1M235D1M
min_diag, max_diag = (-85, 154)
# min_diag, max_diag <-- NOPE: different CIGAR string and different score
align_benchmark -a gap-affine2p-wfa --affine2p-penalties 0,3,4,2,24,1 --wfa-memory high -i y.seq --wfa-heuristic banded-static --wfa-heuristic-parameters -85,154 --output o 2> /dev/null && cat o
-740 514M1D366M1X104M1X92M1X197M1X261M3I664M1X126M1X17M1X572M1X525M1X234M5D320M1X584M1X221M1X64M1X231M1X971M1X586M1X324M1X305M1X344M1X38M1X14M1X109M1X283M1X211M1X1011M1X77M1X33M1X136M1X1190M1X779M1X302M1X420M1X539M1X317M1X2M1X782M1X425M1X496M1X35M1X723M1X146M1X77M1X454M1X16M2D8M1X146M2X14M1X208M158I15M1X6M1X4M2X5M1X2M1X3M2X3M2X1M1X22M1I2M1X6M2X2M1X1M1X7M3X6M2X1M1X4M1X10M2X25M1D6M5D10M228D1M4X6M5D1M
# min_diag - 1, max_diag + 1 <-- NOPE
# min_diag - 2, max_diag + 2 <-- NOPE
#...
# min_diag - 47, max_diag + 47 worked in getting the same CIGAR string
./WFA2-lib/bin/align_benchmark -a gap-affine2p-wfa --affine2p-penalties 0,3,4,2,24,1 --wfa-memory high -i y.seq --wfa-heuristic banded-static --wfa-heuristic-parameters -132,201 --output o 2> /dev/null && cat o
-738 514M1D366M1X104M1X92M1X197M1X261M3I664M1X126M1X17M1X572M1X525M1X234M5D320M1X584M1X221M1X64M1X231M1X971M1X586M1X324M1X305M1X344M1X38M1X14M1X109M1X283M1X211M1X1011M1X77M1X33M1X136M1X1190M1X779M1X302M1X420M1X539M1X317M1X2M1X782M1X425M1X496M1X35M1X723M1X146M1X77M1X454M1X16M2D8M1X146M2X14M1X208M158I15M1X6M1X4M2X5M1X2M1X3M2X3M2X1M1X22M1I2M1X6M2X2M1X1M1X7M3X6M2X1M1X4M1X10M2X25M1D6M5D10M1X1M2I4M2X1M235D1M
I also get this behaviour with gap-affine-wfa with sequences in and z.zip.
We've been able to reproduce the issue and are investigating the root cause of the problem. In the meantime, I've created a patch that provides a temporary fix while we work on a more definitive solution:
cd WFA2-lib/
curl https://gist.githubusercontent.com/quim0/36a7f1a1c0f52d396e61eec94408cc46/raw/02b118bee2b9b3c6e690ae82f22650b07c719ad5/gistfile1.txt > fix.patch
git apply fix.patch
@AndreaGuarracino Can you check whether this issue is fully resolved by the last commit (without the temporal patch)?
I still have problems. I've tested over 5 million cases, and I report here 233 failing tests.
- penalties: 5 (mismatch), 8 (gap_open1), 2 (gap_ext1), 24 (gap_open2), 1 (gap_ext2)
CIGAR from tracepointsis the CIGAR string obtained with WFA2-lib without heuristics.CIGAR from single_band_tracepointsis the CIGAR string obtained with WFA2-lib and a symmetric static band -X-1,+X+1 (X is the last number of thesingle_band_tracepointsrow)CIGAR from double_band_tracepointsis the CIGAR string obtained with WFA2-lib and a not symmetric static band -X-1,+Y+1 (X and Y are the last numbers of thedouble_band_tracepointsrow).seqaandseqbare the sequences that triggered the bug.
A not nice temporary solution to avoid the above bugs is to increase the band not by -X-1,+X+1 or -X-1,+Y+1 but by -X-108,+X+108 or -X-108,+Y+108.
Is there any super good news about this?