strobealign
strobealign copied to clipboard
Optimize parameters (again)
Here are suggested new indexing parameters for all read lengths.
This supersedes #397.
I ran the optimization script for both v0.12.0 (commit 6fd4c5d) and multi-context seeds (commit c4a7f61).
Differences to #397:
- "accuracy slack" is set to 0.1: The accuracy of a single dataset may drop 0.1 percentage points below the baseline without being excluded from further consideration. This is intended to avoid running into local maxima.
- Optimization criterion is regular accuracy, not score-based accuracy.
Command used:
./search.py -c ${commit} -x --accuracy-slack 0.1 --mapping-rate-slack 1 -r ${read_length}
Suggested changes
Parameters are given as a tuple $(k, s, l, u)$.
I did not mechanically pick the settings that optimize mapping-only accuracy, but made sure that they also work well for extension alignment mode. Many parameter settings are found that are essentially equally good, so it was possible for me to find settings that work equally well for v0.12.0 and multi-context seeds, except for read lengths 100 and 150.
Readl. | Before | Suggestion | Comment |
---|---|---|---|
50 | (18, 14, -2, 1) | (16, 12, -2, 0) | |
75 | (20, 16, -3, 2) | (20, 16, -3, -1) | alternative: (21, 17, -3, 1) |
100 | (20, 16, -2, 2) | (16, 12, 1, 3) | for v0.12. Alternative (17, 13, 1, 3) is very similar |
100 | (20, 16, -2, 2) | (18, 14, 1, 3) | for multi-context seeds |
125 | (20, 16, -1, 4) | - | not measured |
150 | (20, 16, 1, 7) | (20, 16, 2, 5) | for v0.12. Reduces ext. alignment SE accuracy slightly; alternative (20, 16, 2, 8) would not (but improve mapping-only PE accuracy much less) |
150 | (20, 16, 1, 7) | (22, 18, 3, 5) | for multi-context seeds. Reduces ext. alignment SE accuracy slightly; alternative (23, 19, 2, 7) would not (but improve mapping-only PE accuracy a bit less) |
200 | (22, 18, 2, 12) | (24, 20, 4, 12) | |
300 | (22, 18, 2, 12) | (24, 20, 5, 13) | |
500 | (23, 17, 2, 12) | (25, 19, 7, 13) |
We only have canonical read length 250. Using the interpolated parameters (24, 20, 5, 12) or (24, 20, 4, 12) gives ok results for read lengths 200 and 300.
The script was run in a mode where it optimizes mapping-only accuracy. I am currently running it to optimize extension-aligment accuracy. In theory, the results could be different. So far, for the read lengths that are finished (currently 50, 75, 100), they are not.
Details for v0.12
This shows how mapping-only and extension-alignment accuracy change for the suggested parameters.
Readlen. | kslu | maponly SE | maponly PE | extalign SE | extalign PE |
---|---|---|---|---|---|
50 | (16, 12, -2, 0) | +0.7657 | +1.1146 | +0.9158 | +0.2027 |
75 | (20, 16, -3, -1) | -0.0090 | +0.1043 | +0.0296 | +0.0170 |
75 | (21, 17, -3, 1) | +0.0397 | +0.0744 | -0.0139 | +0.0229 |
100 | (16, 12, 1, 3) | +0.6626 | +0.4397 | +0.2958 | +0.1274 |
100 | (17, 13, 1, 3) | +0.6701 | +0.4101 | +0.2421 | +0.1311 |
150 | (20, 16, 2, 5) | -0.0016 | +0.0917 | -0.0119 | +0.0357 |
150 | (20, 16, 2, 8) | +0.1089 | +0.0357 | +0.0204 | +0.0241 |
200 | (24, 20, 4, 12) | +0.0516 | +0.0533 | +0.0041 | +0.0295 |
200 | (24, 20, 5, 12) | +0.0150 | +0.0496 | ||
300 | (24, 20, 4, 12) | +0.1591 | +0.0674 | ||
300 | (24, 20, 5, 12) | +0.1725 | +0.0729 | ||
300 | (24, 20, 5, 13) | +0.2264 | +0.0809 | +0.0438 | +0.0315 |
400 | (25, 19, 7, 13) | +0.2737 | +0.1441 | +0.0520 | +0.0306 |
More details
Details have been shortened because GitHub’s maximum comment size was reached.
# v0.12.0 ## Read length 50: Weighted SE/PE results - mapping-only parameters acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (16, 12, -2, 0) 63.0373 74.6368 +0.7657 +1.1146 96.585 96.573 +0.457 +0.443 pareto (16, 12, -2, 1) 62.9708 74.5278 +0.6992 +1.0056 96.693 96.680 +0.564 +0.551 (16, 12, -2, 2) 62.9749 74.5250 +0.7033 +1.0028 96.694 96.682 +0.565 +0.552 (17, 13, -2, 0) 62.7352 74.2179 +0.4637 +0.6957 96.297 96.293 +0.169 +0.163 (17, 13, -2, 1) 62.6388 74.0388 +0.3673 +0.5166 96.426 96.421 +0.297 +0.291 (17, 13, -2, 2) 62.6360 74.0215 +0.3645 +0.4993 96.430 96.426 +0.302 +0.296 (18, 14, -2, 0) 62.4612 73.7873 +0.1897 +0.2651 95.983 95.983 -0.145 -0.146 (18, 14, -2, 1) 62.2715 73.5222 -0.0000 +0.0000 96.128 96.130 +0.000 +0.000 ***** (18, 14, -2, 2) 62.2643 73.4951 -0.0072 -0.0270 96.137 96.139 +0.009 +0.009 ## Read length 50: Weighted SE/PE results - with extension alignment parameters sacc_se sacc_pe diff_se diff_pe acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (16, 12, -2, 0) 92.8490 97.3130 +0.4439 -0.0220 66.0610 80.8919 +0.9158 +0.2027 96.585 99.455 +0.457 -0.056 pareto (18, 14, -2, 1) 92.4051 97.3350 +0.0000 -0.0000 65.1452 80.6892 +0.0000 +0.0000 96.128 99.511 +0.000 +0.000 ## Read length 75: Weighted SE/PE results - mapping-only parameters acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (20, 16, -3, -1) 71.5283 82.4506 -0.0090 +0.1043 98.817 98.823 -0.182 -0.181 pareto (21, 17, -3, -1) 71.5770 82.4207 +0.0397 +0.0744 98.747 98.750 -0.252 -0.254 pareto (20, 16, -3, 0) 71.5360 82.3733 -0.0013 +0.0269 98.979 98.984 -0.020 -0.020 (20, 16, -3, 2) 71.5373 82.3463 -0.0000 +0.0000 98.999 99.004 +0.000 +0.000 ***** (20, 16, -3, 3) 71.5373 82.3463 -0.0000 +0.0000 98.999 99.004 +0.000 +0.000 (20, 16, -3, 1) 71.5378 82.3412 +0.0004 -0.0051 98.999 99.004 -0.000 -0.000 ## Read length 75: Weighted SE/PE results - with extension alignment parameters sacc_se sacc_pe diff_se diff_pe acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (20, 16, -3, -1) 95.6523 98.4026 -0.1678 -0.0217 74.8279 86.7332 +0.0296 +0.0170 98.817 99.761 -0.182 -0.029 pareto (21, 17, -3, -1) 95.6500 98.4203 -0.1701 -0.0040 74.7843 86.7391 -0.0139 +0.0229 98.747 99.754 -0.252 -0.036 pareto (20, 16, -3, 2) 95.8202 98.4242 +0.0000 +0.0000 74.7982 86.7161 +0.0000 +0.0000 98.999 99.790 +0.000 +0.000 ***** ## Read length 100: Weighted SE/PE results - mapping-only parameters acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (16, 12, 1, 3) 77.3307 86.5965 +0.6626 +0.4397 99.181 99.180 -0.301 -0.301 pareto (17, 13, 1, 3) 77.3382 86.5670 +0.6701 +0.4101 99.118 99.120 -0.364 -0.361 pareto (18, 14, 1, 3) 77.3075 86.5272 +0.6393 +0.3704 99.042 99.040 -0.440 -0.442 (17, 13, 0, 3) 76.9034 86.3236 +0.2352 +0.1668 99.387 99.387 -0.095 -0.094 (18, 14, 0, 3) 76.9109 86.3185 +0.2428 +0.1617 99.344 99.342 -0.138 -0.140 (16, 12, 0, 3) 76.8668 86.3232 +0.1986 +0.1664 99.446 99.446 -0.036 -0.035 (18, 14, 0, 2) 76.8087 86.3316 +0.1405 +0.1747 99.250 99.250 -0.232 -0.232 (17, 13, 0, 2) 76.7686 86.3275 +0.1004 +0.1707 99.292 99.293 -0.190 -0.188 (19, 15, 0, 3) 76.8939 86.2880 +0.2257 +0.1311 99.283 99.285 -0.199 -0.196 (19, 15, 0, 2) 76.7953 86.2902 +0.1272 +0.1333 99.202 99.201 -0.280 -0.280 (16, 12, 0, 2) 76.7280 86.3065 +0.0598 +0.1496 99.347 99.347 -0.135 -0.134 (20, 16, -2, 2) 76.6682 86.1568 +0.0000 +0.0000 99.482 99.481 +0.000 +0.000 ***** (20, 16, -2, 3) 76.6916 86.1485 +0.0234 -0.0083 99.491 99.490 +0.009 +0.008 (21, 17, -2, 1) 76.5986 86.1625 -0.0696 +0.0057 99.393 99.392 -0.089 -0.089 ## Read length 100: Weighted SE/PE results - with extension alignment parameters sacc_se sacc_pe diff_se diff_pe acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (16, 12, 1, 3) 96.8270 98.8985 +0.0900 +0.0661 80.2273 89.7221 +0.2958 +0.1274 99.181 99.770 -0.301 -0.036 pareto (17, 13, 1, 3) 96.8201 98.9200 +0.0831 +0.0876 80.1736 89.7258 +0.2421 +0.1311 99.118 99.769 -0.364 -0.036 pareto (20, 16, -2, 2) 96.7371 98.8324 +0.0000 +0.0000 79.9315 89.5947 +0.0000 +0.0000 99.482 99.805 +0.000 +0.000 ***** ## Read length 150: Weighted SE/PE results - mapping-only parameters acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (21, 17, 2, 6) 83.8433 90.4057 +0.0664 +0.0783 99.662 99.662 -0.065 -0.064 pareto (20, 16, 2, 5) 83.7752 90.4192 -0.0016 +0.0917 99.666 99.668 -0.061 -0.059 pareto (20, 16, 2, 6) 83.8460 90.3985 +0.0692 +0.0711 99.685 99.686 -0.042 -0.041 pareto (21, 17, 2, 5) 83.7807 90.4081 +0.0038 +0.0806 99.645 99.643 -0.082 -0.084 pareto (20, 16, 3, 6) 83.8200 90.3907 +0.0432 +0.0633 99.629 99.627 -0.098 -0.100 (21, 17, 2, 7) 83.8663 90.3741 +0.0895 +0.0466 99.672 99.672 -0.055 -0.055 pareto (20, 16, 2, 8) 83.8858 90.3632 +0.1089 +0.0357 99.699 99.698 -0.028 -0.029 pareto (20, 16, 2, 7) 83.8592 90.3694 +0.0824 +0.0420 99.695 99.695 -0.032 -0.032 (20, 16, 3, 5) 83.7589 90.3940 -0.0179 +0.0666 99.584 99.582 -0.143 -0.145 (19, 15, 3, 7) 83.7862 90.3850 +0.0093 +0.0575 99.709 99.708 -0.018 -0.019 (22, 18, 2, 7) 83.8621 90.3577 +0.0852 +0.0302 99.644 99.641 -0.083 -0.086 (21, 17, 2, 8) 83.8798 90.3507 +0.1030 +0.0232 99.682 99.680 -0.045 -0.046 (20, 16, 3, 7) 83.8378 90.3611 +0.0609 +0.0337 99.648 99.647 -0.079 -0.080 (19, 15, 4, 7) 83.7849 90.3693 +0.0081 +0.0418 99.649 99.648 -0.078 -0.079 (19, 15, 4, 8) 83.8282 90.3539 +0.0513 +0.0264 99.671 99.670 -0.056 -0.056 (19, 15, 3, 8) 83.8379 90.3502 +0.0610 +0.0228 99.717 99.716 -0.010 -0.011 (20, 16, 3, 8) 83.8600 90.3425 +0.0832 +0.0150 99.661 99.659 -0.067 -0.068 (19, 15, 4, 6) 83.7073 90.3794 -0.0696 +0.0520 99.611 99.610 -0.116 -0.117 (21, 17, 1, 7) 83.7966 90.3424 +0.0197 +0.0149 99.712 99.711 -0.015 -0.015 (18, 14, 4, 8) 83.7598 90.3319 -0.0170 +0.0044 99.693 99.690 -0.034 -0.036 (20, 16, 1, 7) 83.7768 90.3275 +0.0000 +0.0000 99.727 99.727 +0.000 +0.000 ***** (20, 16, 1, 8) 83.8208 90.3148 +0.0440 -0.0127 99.730 99.729 +0.003 +0.002 (21, 17, 1, 8) 83.8313 90.3107 +0.0545 -0.0167 99.715 99.714 -0.012 -0.013 (18, 14, 3, 8) 83.7611 90.3143 -0.0157 -0.0131 99.728 99.727 +0.001 +0.000 (22, 18, 1, 6) 83.7280 90.3213 -0.0488 -0.0061 99.681 99.680 -0.046 -0.046 (22, 18, 1, 7) 83.7775 90.3080 +0.0007 -0.0194 99.686 99.684 -0.041 -0.043 (19, 15, 2, 8) 83.7611 90.3010 -0.0158 -0.0264 99.741 99.740 +0.014 +0.013 (18, 14, 5, 8) 83.6952 90.2884 -0.0816 -0.0391 99.633 99.630 -0.094 -0.096 ## Read length 150: Weighted SE/PE results - with extension alignment parameters sacc_se sacc_pe diff_se diff_pe acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (21, 17, 2, 5) 97.8650 99.2698 -0.0823 +0.0274 86.1799 92.3607 -0.0529 +0.0468 99.645 99.780 -0.082 -0.001 pareto (20, 16, 2, 5) 97.8731 99.2638 -0.0742 +0.0214 86.2209 92.3496 -0.0119 +0.0357 99.666 99.780 -0.061 -0.001 pareto (21, 17, 2, 7) 97.9767 99.2781 +0.0294 +0.0357 86.2365 92.3424 +0.0037 +0.0285 99.672 99.781 -0.055 -0.000 pareto (20, 16, 2, 8) 98.0105 99.2648 +0.0632 +0.0225 86.2533 92.3380 +0.0204 +0.0241 99.699 99.781 -0.028 -0.000 pareto (21, 17, 2, 6) 97.9248 99.2729 -0.0225 +0.0305 86.2055 92.3472 -0.0273 +0.0333 99.662 99.781 -0.065 -0.000 (20, 16, 2, 6) 97.9302 99.2642 -0.0171 +0.0218 86.2404 92.3337 +0.0076 +0.0198 99.685 99.781 -0.042 -0.000 (20, 16, 1, 7) 97.9473 99.2424 -0.0000 -0.0000 86.2328 92.3139 +0.0000 +0.0000 99.727 99.781 +0.000 +0.000 ***** ## Read length 200: Weighted SE/PE results - mapping-only parameters acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (24, 20, 4, 11) 87.5395 91.8639 +0.0401 +0.0623 99.720 99.717 -0.025 -0.025 pareto (24, 20, 3, 12) 87.5548 91.8545 +0.0554 +0.0528 99.732 99.730 -0.013 -0.013 pareto (24, 20, 4, 12) 87.5510 91.8549 +0.0516 +0.0533 99.721 99.718 -0.024 -0.024 pareto (24, 20, 4, 10) 87.4988 91.8656 -0.0006 +0.0640 99.719 99.716 -0.026 -0.027 pareto (24, 20, 3, 10) 87.5058 91.8626 +0.0065 +0.0610 99.730 99.729 -0.014 -0.014 (24, 20, 3, 11) 87.5212 91.8571 +0.0218 +0.0555 99.732 99.730 -0.013 -0.013 (24, 20, 3, 13) 87.5738 91.8424 +0.0744 +0.0408 99.733 99.731 -0.012 -0.012 pareto (23, 19, 3, 12) 87.5488 91.8472 +0.0494 +0.0455 99.737 99.735 -0.008 -0.008 (23, 19, 3, 10) 87.4930 91.8580 -0.0064 +0.0563 99.736 99.734 -0.009 -0.009 ... (22, 18, 2, 12) 87.4994 91.8016 +0.0000 +0.0000 99.745 99.743 +0.000 +0.000 ***** ## Read length 200: Weighted SE/PE results - with extension alignment parameters sacc_se sacc_pe diff_se diff_pe acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (24, 20, 4, 12) 98.4782 99.3803 +0.0502 +0.0442 89.4550 93.1223 +0.0041 +0.0295 99.721 99.750 -0.024 +0.000 pareto (24, 20, 3, 12) 98.4726 99.3639 +0.0446 +0.0279 89.4582 93.1198 +0.0074 +0.0270 99.732 99.750 -0.013 +0.000 pareto (24, 20, 4, 11) 98.4497 99.3765 +0.0216 +0.0404 89.4347 93.1223 -0.0161 +0.0295 99.720 99.750 -0.025 +0.000 (24, 20, 3, 13) 98.4901 99.3666 +0.0620 +0.0306 89.4578 93.1104 +0.0070 +0.0176 99.733 99.750 -0.012 +0.000 (24, 20, 4, 10) 98.4217 99.3768 -0.0063 +0.0407 89.4191 93.1175 -0.0318 +0.0246 99.719 99.750 -0.026 +0.000 (22, 18, 2, 12) 98.4280 99.3361 +0.0000 -0.0000 89.4508 93.0928 +0.0000 +0.0000 99.745 99.750 +0.000 +0.000 ***** ## Read length 300: Weighted SE/PE results - mapping-only parameters acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (24, 20, 6, 13) 91.0242 94.7012 +0.2182 +0.0960 99.692 99.690 -0.001 -0.001 pareto (24, 20, 7, 13) 91.0230 94.6993 +0.2171 +0.0940 99.691 99.690 -0.001 -0.001 (24, 20, 8, 13) 90.9953 94.7001 +0.1893 +0.0948 99.690 99.689 -0.002 -0.002 (24, 20, 5, 13) 91.0323 94.6862 +0.2264 +0.0809 99.692 99.691 -0.000 -0.000 pareto (23, 19, 6, 13) 91.0143 94.6817 +0.2084 +0.0764 99.692 99.690 -0.000 -0.000 (24, 20, 6, 12) 90.9737 94.6912 +0.1678 +0.0859 99.691 99.690 -0.001 -0.001 ... (22, 18, 2, 12) 90.8059 94.6053 +0.0000 +0.0000 99.692 99.691 +0.000 +0.000 ***** ## Read length 300: Weighted SE/PE results - with extension alignment parameters sacc_se sacc_pe diff_se diff_pe acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (24, 20, 5, 13) 98.7497 99.4555 +0.0807 +0.0234 92.4811 95.6001 +0.0438 +0.0315 99.692 99.691 -0.000 +0.000 pareto (24, 20, 6, 13) 98.7494 99.4608 +0.0804 +0.0287 92.4793 95.6004 +0.0419 +0.0319 99.692 99.691 -0.001 -0.000 pareto (22, 18, 2, 12) 98.6690 99.4322 +0.0000 +0.0000 92.4373 95.5686 -0.0000 +0.0000 99.692 99.691 +0.000 +0.000 ***** ## Read length 500: Weighted SE/PE results - mapping-only parameters acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (25, 19, 8, 13) 93.5670 95.5009 +0.2708 +0.1493 99.578 99.574 -0.000 -0.000 pareto (25, 19, 7, 13) 93.5699 95.4957 +0.2737 +0.1441 99.578 99.574 -0.000 -0.000 pareto (25, 19, 6, 13) 93.5695 95.4906 +0.2733 +0.1390 99.578 99.574 -0.000 +0.000 (25, 19, 7, 12) 93.5153 95.4898 +0.2191 +0.1382 99.578 99.574 -0.000 -0.000 ... (23, 17, 2, 12) 93.2962 95.3516 +0.0000 +0.0000 99.578 99.574 +0.000 +0.000 ***** ## Read length 500: Weighted SE/PE results - with extension alignment parameters sacc_se sacc_pe diff_se diff_pe acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (25, 19, 8, 13) 98.9691 99.3465 +0.0594 +0.0285 94.7018 96.0529 +0.0562 +0.0347 99.578 99.574 -0.000 +0.000 pareto (25, 19, 7, 13) 98.9757 99.3490 +0.0660 +0.0310 94.6976 96.0488 +0.0520 +0.0306 99.578 99.574 -0.000 -0.000 (23, 17, 2, 12) 98.9097 99.3180 +0.0000 +0.0000 94.6456 96.0182 +0.0000 +0.0000 99.578 99.574 +0.000 +0.000 ***** # Multi-context seeds (c4a7f61) ## Read length 50: Weighted SE/PE results - mapping-only parameters acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (16, 12, -2, 0) 63.2349 74.9216 +0.3466 +0.6877 97.123 97.112 +0.146 +0.136 pareto (16, 12, -2, 1) 63.2007 74.8490 +0.3124 +0.6150 97.252 97.244 +0.275 +0.268 (16, 12, -2, 2) 63.1981 74.8449 +0.3097 +0.6110 97.254 97.246 +0.277 +0.270 (17, 13, -2, 0) 63.1308 74.6860 +0.2425 +0.4520 96.977 96.969 -0.000 -0.007 (17, 13, -2, 1) 63.0462 74.5605 +0.1578 +0.3265 97.151 97.143 +0.174 +0.168 (17, 13, -2, 2) 63.0538 74.5557 +0.1654 +0.3218 97.158 97.150 +0.181 +0.174 (18, 14, -2, 0) 62.9883 74.4344 +0.0999 +0.2004 96.756 96.754 -0.220 -0.222 (18, 14, -2, 1) 62.8884 74.2339 +0.0000 +0.0000 96.977 96.976 +0.000 +0.000 ***** (18, 14, -2, 2) 62.8892 74.2215 +0.0008 -0.0124 96.992 96.992 +0.015 +0.016 ## Read length 50: Weighted SE/PE results - with extension alignment parameters sacc_se sacc_pe diff_se diff_pe acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (16, 12, -2, 0) 93.1696 97.3440 +0.0728 -0.0574 66.3488 80.9302 +0.5248 +0.1651 97.123 99.487 +0.146 -0.077 pareto (18, 14, -2, 1) 93.0968 97.4014 +0.0000 +0.0000 65.8240 80.7651 +0.0000 +0.0000 96.977 99.565 +0.000 +0.000 ***** ## Read length 75: Weighted SE/PE results - mapping-only parameters acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (21, 17, -3, -1) 71.7299 82.5870 +0.0342 +0.0823 98.887 98.888 -0.258 -0.260 pareto (20, 16, -3, 0) 71.6998 82.5136 +0.0042 +0.0089 99.119 99.123 -0.025 -0.025 (20, 16, -3, 1) 71.6971 82.5044 +0.0015 -0.0003 99.144 99.148 -0.000 -0.000 (20, 16, -3, 2) 71.6957 82.5047 +0.0000 -0.0000 99.145 99.149 +0.000 +0.000 ***** (20, 16, -3, 3) 71.6957 82.5047 +0.0000 -0.0000 99.145 99.149 +0.000 +0.000 ## Read length 75: Weighted SE/PE results - with extension alignment parameters sacc_se sacc_pe diff_se diff_pe acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (21, 17, -3, -1) 95.7962 98.4268 -0.1689 -0.0038 74.9228 86.7618 -0.0016 +0.0307 98.887 99.756 -0.258 -0.036 pareto (20, 16, -3, 2) 95.9650 98.4305 +0.0000 +0.0000 74.9245 86.7311 +0.0000 +0.0000 99.145 99.792 +0.000 -0.000 ***** pareto ## Read length 100: Weighted SE/PE results - mapping-only parameters acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (18, 14, 1, 3) 77.4674 86.7038 +0.6795 +0.4186 99.235 99.238 -0.367 -0.363 pareto (17, 13, 1, 3) 77.4442 86.6853 +0.6563 +0.4001 99.266 99.267 -0.335 -0.335 (19, 15, 1, 3) 77.4468 86.6761 +0.6589 +0.3909 99.221 99.220 -0.380 -0.382 (16, 12, 1, 3) 77.3820 86.6520 +0.5941 +0.3668 99.289 99.288 -0.312 -0.314 (20, 16, 0, 3) 77.4297 86.6071 +0.6418 +0.3219 99.199 99.195 -0.402 -0.407 (20, 16, 0, 2) 77.4049 86.6042 +0.6170 +0.3190 99.164 99.159 -0.437 -0.442 (19, 15, 0, 3) 77.0658 86.4753 +0.2779 +0.1900 99.471 99.472 -0.131 -0.129 (21, 17, -1, 1) 77.0216 86.4653 +0.2337 +0.1801 99.312 99.308 -0.289 -0.293 (20, 16, -1, 3) 77.1043 86.4446 +0.3164 +0.1594 99.451 99.452 -0.151 -0.149 (20, 16, -1, 1) 76.9934 86.4722 +0.2056 +0.1870 99.345 99.349 -0.256 -0.253 (20, 16, -1, 2) 77.0752 86.4512 +0.2874 +0.1660 99.431 99.434 -0.171 -0.168 (18, 14, 0, 3) 77.0269 86.4572 +0.2390 +0.1720 99.484 99.484 -0.118 -0.118 (19, 15, 0, 2) 76.9685 86.4657 +0.1807 +0.1805 99.389 99.387 -0.213 -0.215 (18, 14, 0, 2) 76.9113 86.4563 +0.1234 +0.1711 99.393 99.394 -0.208 -0.208 (17, 13, 0, 3) 76.9784 86.4225 +0.1905 +0.1373 99.495 99.497 -0.107 -0.105 (17, 13, 0, 2) 76.8298 86.4074 +0.0419 +0.1222 99.402 99.404 -0.199 -0.198 (16, 12, 0, 3) 76.9166 86.3725 +0.1287 +0.0873 99.525 99.525 -0.076 -0.076 (22, 18, -2, 3) 76.8749 86.3023 +0.0871 +0.0171 99.561 99.563 -0.040 -0.039 (22, 18, -2, 1) 76.7769 86.3194 -0.0110 +0.0342 99.512 99.512 -0.089 -0.090 (22, 18, -2, 2) 76.8542 86.2984 +0.0663 +0.0132 99.553 99.554 -0.048 -0.048 (21, 17, -2, 2) 76.8275 86.3046 +0.0397 +0.0194 99.580 99.579 -0.022 -0.023 (21, 17, -2, 3) 76.8509 86.2957 +0.0631 +0.0105 99.588 99.587 -0.013 -0.014 (21, 17, -2, 1) 76.7558 86.3089 -0.0320 +0.0237 99.536 99.534 -0.065 -0.068 (20, 16, -2, 3) 76.8415 86.2799 +0.0536 -0.0053 99.611 99.611 +0.010 +0.009 (20, 16, -2, 2) 76.7879 86.2852 +0.0000 +0.0000 99.601 99.602 +0.000 +0.000 ***** ## Read length 100: Weighted SE/PE results - with extension alignment parameters sacc_se sacc_pe diff_se diff_pe acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (18, 14, 1, 3) 96.9435 98.9435 +0.0754 +0.1069 80.2780 89.7660 +0.2204 +0.1587 99.235 99.773 -0.367 -0.033 pareto (20, 16, -2, 2) 96.8680 98.8366 +0.0000 +0.0000 80.0575 89.6073 +0.0000 +0.0000 99.601 99.806 +0.000 +0.000 ***** ## Read length 150: Weighted SE/PE results - mapping-only parameters acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (23, 19, 3, 6) 83.9188 90.4801 +0.1618 +0.1330 99.690 99.689 -0.072 -0.072 pareto (23, 19, 3, 5) 83.8571 90.4898 +0.1000 +0.1427 99.653 99.653 -0.109 -0.109 pareto (22, 18, 3, 5) 83.8277 90.4916 +0.0707 +0.1445 99.662 99.664 -0.099 -0.098 pareto (22, 18, 3, 6) 83.8811 90.4733 +0.1240 +0.1263 99.700 99.700 -0.062 -0.061 (23, 19, 2, 5) 83.8464 90.4809 +0.0894 +0.1338 99.710 99.709 -0.052 -0.053 (23, 19, 3, 7) 83.9240 90.4585 +0.1670 +0.1115 99.710 99.709 -0.052 -0.053 pareto (21, 17, 3, 5) 83.8062 90.4857 +0.0491 +0.1387 99.669 99.669 -0.093 -0.092 (23, 19, 2, 6) 83.8871 90.4654 +0.1301 +0.1184 99.725 99.725 -0.037 -0.037 (23, 19, 2, 7) 83.9296 90.4526 +0.1725 +0.1055 99.734 99.734 -0.028 -0.028 pareto (22, 18, 2, 5) 83.8371 90.4757 +0.0800 +0.1286 99.714 99.712 -0.048 -0.049 (21, 17, 2, 6) 83.8632 90.4686 +0.1062 +0.1215 99.734 99.734 -0.027 -0.028 (22, 18, 2, 6) 83.8924 90.4575 +0.1354 +0.1104 99.731 99.730 -0.031 -0.032 (21, 17, 3, 6) 83.8703 90.4606 +0.1133 +0.1135 99.711 99.709 -0.051 -0.052 (21, 17, 3, 7) 83.8881 90.4484 +0.1311 +0.1014 99.726 99.725 -0.036 -0.037 (22, 18, 2, 7) 83.9069 90.4414 +0.1498 +0.0943 99.740 99.740 -0.022 -0.022 (20, 16, 3, 6) 83.8398 90.4549 +0.0828 +0.1079 99.717 99.716 -0.045 -0.046 (22, 18, 3, 7) 83.9017 90.4377 +0.1446 +0.0907 99.720 99.720 -0.042 -0.042 (23, 19, 2, 8) 83.9332 90.4283 +0.1761 +0.0813 99.740 99.739 -0.022 -0.022 pareto (23, 19, 3, 8) 83.9324 90.4264 +0.1754 +0.0793 99.721 99.720 -0.041 -0.042 ... (20, 16, 1, 7) 83.7570 90.3470 +0.0000 +0.0000 99.762 99.762 -0.000 +0.000 ***** ## Read length 150: Weighted SE/PE results - with extension alignment parameters sacc_se sacc_pe diff_se diff_pe acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (22, 18, 3, 5) 97.9170 99.3012 -0.0674 +0.0578 86.2434 92.3854 -0.0286 +0.0711 99.662 99.779 -0.099 -0.002 pareto (23, 19, 3, 5) 97.9233 99.3057 -0.0610 +0.0623 86.2171 92.3813 -0.0549 +0.0670 99.653 99.779 -0.109 -0.002 (23, 19, 3, 6) 97.9852 99.3068 +0.0008 +0.0634 86.2630 92.3627 -0.0091 +0.0484 99.690 99.780 -0.072 -0.001 pareto (23, 19, 3, 7) 98.0339 99.3086 +0.0496 +0.0652 86.2616 92.3630 -0.0105 +0.0487 99.710 99.781 -0.052 -0.000 pareto (23, 19, 2, 7) 98.0662 99.2976 +0.0819 +0.0542 86.2846 92.3562 +0.0126 +0.0419 99.734 99.781 -0.028 -0.000 pareto (23, 19, 2, 8) 98.1032 99.2958 +0.1189 +0.0524 86.3059 92.3469 +0.0339 +0.0326 99.740 99.781 -0.022 -0.000 pareto (20, 16, 1, 7) 97.9843 99.2434 +0.0000 +0.0000 86.2720 92.3143 +0.0000 +0.0000 99.762 99.781 -0.000 +0.000 ***** ## Read length 200: Weighted SE/PE results - mapping-only parameters acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (24, 20, 4, 12) 87.4547 91.8541 +0.0733 +0.0944 99.747 99.745 -0.003 -0.003 pareto (24, 20, 5, 10) 87.4016 91.8602 +0.0203 +0.1006 99.741 99.739 -0.009 -0.009 pareto (24, 20, 5, 11) 87.4243 91.8544 +0.0430 +0.0947 99.743 99.741 -0.007 -0.007 pareto (24, 20, 4, 11) 87.4371 91.8494 +0.0557 +0.0897 99.746 99.744 -0.004 -0.004 (24, 20, 4, 10) 87.4239 91.8470 +0.0426 +0.0874 99.745 99.744 -0.005 -0.005 (24, 20, 5, 12) 87.4302 91.8424 +0.0489 +0.0827 99.744 99.742 -0.006 -0.006 (24, 20, 4, 13) 87.4852 91.8282 +0.1038 +0.0685 99.748 99.746 -0.002 -0.002 pareto (24, 20, 5, 13) 87.4570 91.8312 +0.0757 +0.0716 99.745 99.743 -0.005 -0.005 pareto (24, 20, 3, 13) 87.4898 91.8222 +0.1084 +0.0626 99.749 99.747 -0.001 -0.001 pareto (24, 20, 3, 10) 87.4065 91.8406 +0.0252 +0.0809 99.747 99.746 -0.003 -0.003 ... (22, 18, 2, 12) 87.3813 91.7597 +0.0000 +0.0000 99.750 99.748 +0.000 +0.000 ***** ## Read length 200: Weighted SE/PE results - with extension alignment parameters sacc_se sacc_pe diff_se diff_pe acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (24, 20, 4, 12) 98.5003 99.3799 +0.0720 +0.0447 89.4840 93.1277 +0.0491 +0.0333 99.747 99.750 -0.003 +0.000 pareto (24, 20, 5, 13) 98.5112 99.3841 +0.0828 +0.0489 89.4901 93.1235 +0.0551 +0.0291 99.745 99.750 -0.005 +0.000 pareto (24, 20, 5, 11) 98.4640 99.3801 +0.0356 +0.0449 89.4556 93.1223 +0.0206 +0.0279 99.743 99.750 -0.007 +0.000 (24, 20, 4, 13) 98.5208 99.3765 +0.0924 +0.0413 89.4810 93.1150 +0.0461 +0.0207 99.748 99.750 -0.002 +0.000 (24, 20, 5, 10) 98.4288 99.3803 +0.0004 +0.0451 89.4413 93.1245 +0.0064 +0.0302 99.741 99.750 -0.009 +0.000 (24, 20, 3, 13) 98.5043 99.3654 +0.0759 +0.0302 89.4898 93.1114 +0.0548 +0.0170 99.749 99.750 -0.001 +0.000 (22, 18, 2, 12) 98.4284 99.3352 +0.0000 +0.0000 89.4350 93.0943 +0.0000 +0.0000 99.750 99.750 +0.000 +0.000 ***** ## Read length 300: Weighted SE/PE results - mapping-only parameters acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (24, 20, 6, 13) 90.7976 94.5880 +0.1863 +0.0741 99.692 99.691 -0.000 -0.000 pareto (24, 20, 5, 13) 90.8041 94.5827 +0.1927 +0.0688 99.692 99.691 +0.000 +0.000 pareto (24, 20, 7, 13) 90.7785 94.5875 +0.1671 +0.0735 99.692 99.691 -0.000 -0.000 (24, 20, 6, 12) 90.7392 94.5910 +0.1279 +0.0770 99.692 99.690 -0.000 -0.000 pareto (24, 20, 4, 13) 90.7948 94.5756 +0.1835 +0.0616 99.692 99.691 +0.000 +0.000 (24, 20, 4, 12) 90.7514 94.5846 +0.1400 +0.0707 99.692 99.691 -0.000 -0.000 (24, 20, 5, 12) 90.7636 94.5773 +0.1523 +0.0633 99.692 99.691 +0.000 -0.000 (24, 20, 3, 13) 90.7737 94.5729 +0.1624 +0.0590 99.692 99.691 +0.000 +0.000 ... (22, 18, 2, 12) 90.6113 94.5139 +0.0000 +0.0000 99.692 99.691 +0.000 +0.000 ***** ## Read length 300: Weighted SE/PE results - with extension alignment parameters sacc_se sacc_pe diff_se diff_pe acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (24, 20, 6, 13) 98.7377 99.4604 +0.0772 +0.0287 92.4746 95.6028 +0.0431 +0.0364 99.692 99.691 -0.000 -0.000 pareto (24, 20, 5, 13) 98.7374 99.4547 +0.0769 +0.0229 92.4670 95.5950 +0.0355 +0.0286 99.692 99.691 +0.000 +0.000 (24, 20, 6, 12) 98.7202 99.4553 +0.0596 +0.0235 92.4535 95.5976 +0.0220 +0.0312 99.692 99.691 -0.000 -0.000 (22, 18, 2, 12) 98.6606 99.4318 +0.0000 +0.0000 92.4316 95.5664 +0.0000 +0.0000 99.692 99.691 +0.000 +0.000 ***** ## Read length 500: Weighted SE/PE results - mapping-only parameters acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (25, 19, 7, 13) 93.2384 95.3643 +0.1550 +0.1071 99.578 99.574 -0.000 -0.000 pareto (25, 19, 6, 13) 93.2549 95.3558 +0.1714 +0.0986 99.578 99.574 -0.000 +0.000 pareto (25, 19, 5, 13) 93.2581 95.3468 +0.1747 +0.0896 99.578 99.574 +0.000 +0.000 pareto (25, 19, 7, 12) 93.1965 95.3606 +0.1131 +0.1034 99.578 99.574 -0.000 -0.000 (25, 19, 6, 12) 93.2142 95.3496 +0.1307 +0.0924 99.578 99.574 -0.000 +0.000 (25, 19, 4, 13) 93.2463 95.3391 +0.1628 +0.0819 99.578 99.574 +0.000 +0.000 ... (23, 17, 2, 12) 93.0834 95.2572 +0.0000 +0.0000 99.578 99.574 +0.000 +0.000 ***** ## Read length 500: Weighted SE/PE results - with extension alignment parameters sacc_se sacc_pe diff_se diff_pe acc_se acc_pe diff_se diff_pe mprt_se mprt_pe diff_se diff_pe (25, 19, 7, 13) 98.9583 99.3693 +0.0592 +0.0267 94.6867 96.0819 +0.0416 +0.0251 99.578 99.574 -0.000 +0.000 pareto (25, 19, 5, 13) 98.9473 99.3601 +0.0482 +0.0175 94.6765 96.0713 +0.0313 +0.0145 99.578 99.574 +0.000 +0.000 (25, 19, 6, 13) 98.9524 99.3623 +0.0533 +0.0197 94.6751 96.0669 +0.0299 +0.0101 99.578 99.574 -0.000 +0.000 (23, 17, 2, 12) 98.8991 99.3426 +0.0000 +0.0000 94.6451 96.0568 +0.0000 +0.0000 99.578 99.574 +0.000 +0.000 *****
Great, I think we could go with your suggested parameter changes in this issue for a benchmark between current hashing and multi-context hashing.
It is interesting that many of the read lengths have the same parameter combination; I am not sure if this is a sign of something bad (e.g., overfitting the design to data, underutilization of partial hits, or underevaluation). Regardless, I think it serves its purpose for now. We are thinking about asymmetrical seeds, which are more important now and may alter things slightly.
(Note: we should probably log how many times we successfully used a 'partial hit', and not the full hit, in the new hashing scheme in further evaluations. Here, 'successfully' is a bit vague and could have several meanings, such as simply finding a partial hit and that they were used in making a higher scoring NAM/pair of NAMs)
I have added two branches to the repository, each with a single new commit that switches to the optimized parameters:
-
v0.12.0-optimized-parameters
is on top of v0.12.0. -
mcs-optimized-parameters
is on top of Ivan’s multi-context-seeds branch
For completeness, I picked (20, 16, 1, 4) for canonical read length 125 for both branches, but this should not be relevant as the test datasets don’t include that read length.
I also noticed that v0.12.0 still has canonical read length 300, so I left it that way and did not use the interpolated parameters as I had originally suggested.
It would be possible to apply these changes on top of v0.13.0, but since I benchmarked v0.12.0 and there have been very few changes since then that affect accuracy, I suggest we stick to v0.12.0.
I have started a benchmark of the two commits.
For completeness, I picked (20, 16, 1, 4) for canonical read length 125 for both branches, but this should not be relevant as the test datasets don’t include that read length.
The evaluation does include read length 125 as well as read lengths ["50", "75", "91", "100", "111", "125", "136", "150", "176", "200", "250", "300", "500"]
to test 'worst case' for some of the parameter ranges.
It would be possible to apply these changes on top of v0.13.0, but since I benchmarked v0.12.0 and there have been very few changes since then that affect accuracy, I suggest we stick to v0.12.0.
It's great to compare these two commits as a checkpoint to see where we are. However, I am afraid this might not be the last benchmark I do between the two seeding variants. The larger goal before an eventual merge of mcs would be to get rid of the redundant NAMs causing redundant extension calls (particularly visible in the mcs branch). Ivan is now exploring the asymmetrical version of mcs, checking whether my comment is true https://github.com/ksahlin/strobealign/pull/405#issuecomment-2001902240. If my guess would be correct, it would be nice to benchmark two asymmetrical versions against each other.
Evaluation is ready (see attached plots). All results are for PE alignment, symmetric seeds. Main points:
Accuracy
- Extension based accuracy is near identical between the two seeds.
- Mapping-only based accuracy is slightly better for mcs for short reads (see particularly drosophila and CHM13), and slightly worse for longer reads. Notable here is the dip at read lengths 111 for our current seeds. Another notable issue is that msc are strictly worse for longer seeds. I do not expect (/accept:) this.
Percent mapped
- mcs beats current seeds in almost all cases and with quite a big margin, which is nice to see.
Runtime
- is seems mcs are more often faster than not for short reads - nice! Possible because of less rescue extension.
- mcs are consistently quite substantially slower than current seeds for the longest reads. Ivan and I believe that this is because more mapping sites are tried with extension due to more matches (coming from partial matches). If 'chaining'/scoring of NAMs is implemented well, I do not see a reason for accepting this. Using asymmetric seeds would lead to better NAM merging, hence scoring, and would take care of this (according to @marcelm's analysis).
Overall:
- mcs offer some clear advantages in mapping (in mapped percentage, accuracy, and time) for short reads, but is currently slightly stifled by NAM scoring/chaining, leading to lower accuracy and slower runtime on longer reads. It will be interesting seeing if this can this be solved with asymmetric seeds. If this last issue is ironed out, I think we have a strong case for using mcs as new strategy.
- Evaluation does not include SE alignment - but all evidence points to msc being even better (relatively) on SE data.
@Itolstoganov
accuracy_plot_cut_at_80.pdf percentage_aligned_plot.pdf time_plot.pdf