Potential issues in using same reference length for large libraries of variable read length?
Hello, thanks for the work on developing this tool! I was wondering the potential issues which could arise from using a pre-built reference for different read length mapping (say mapping a avg 250 on r150 index) for a large library? I do not want to rebuild the index each time to increase speed, however keeping all the indexes on storage for each different size is somewhat inconvenient due to the reference library size.
Due to this, I wanted to potentially use a single reference group (r150) for the entire library , if it doesn't decrease accuracy too much.
Hi @Rridley7 ,
The seeds at the mapping step must match the indexing step in order to get reasonable mapping results. So if your index was created with -r 150, then also specify -r 150 at the mapping step.
As for the accuracy decrease of 'forcing' r150 seeds from, say, 250nt reads, we haven't benchmarked how much accuracy/speed one lose (or gain). I think using a different seed to what has been optimized for has more impact on very short reads (say <75nt).
If I had to guess for your particular case (150 instead of 250), you may lose some percentage in speed (perhaps 5-15%?). As for accuracy, 150 uses shorter seeds than 250. Shorter seeds are more sensitive, but also more repetitive. So it is unclear whether accuracy would go up or down.
I ran a small experiment on our fruit fly test and CHM13 datasets where I used different -r settings to map reads of different actual lengths.
The current stable release 0.15.0 shows quite different behavior than the development version, so I’m reporting both results.
The summary is that the accuracy is still ok if you use a setting for -r that is shorter than the actual read length. (In line with what Kristoffer said.)
Mapping accuracy for v0.15.0
fruit fly
| actual read length | -r 50 |
-r 100 |
-r 150 |
-r 250 |
|---|---|---|---|---|
| 50 | 77.5834 | 66.8407 | 18.3421 | 2.5468 |
| 100 | 87.607 | 86.0051 | 82.3383 | 77.44 |
| 150 | 89.9249 | 89.5392 | 88.4716 | 87.3284 |
| 200 | 90.9968 | 90.9468 | 90.6826 | 90.2266 |
| 300 | 92.1635 | 92.1796 | 92.1974 | 92.2093 |
CHM13
| actual read length | -r 50 |
-r 100 |
-r 150 |
-r 250 |
|---|---|---|---|---|
| 50 | 76.3266 | 66.2691 | 18.2674 | 2.5353 |
| 100 | 88.3019 | 86.9611 | 83.5238 | 78.5511 |
| 150 | 90.8824 | 90.6501 | 89.6729 | 88.5661 |
| 200 | 91.8716 | 91.9708 | 91.8055 | 91.356 |
| 300 | 92.7614 | 92.8975 | 93.0066 | 92.9967 |
Mapping accuracy for main (4fafa73)
Fruit fly
| actual read length | -r 50 |
-r 100 |
-r 150 |
-r 250 |
|---|---|---|---|---|
| 50 | 83.9064 | 82.9692 | 61.5121 | 18.616 |
| 100 | 88.2011 | 88.0476 | 87.7574 | 87.3018 |
| 150 | 90.0482 | 90.0127 | 89.9331 | 89.823 |
| 200 | 90.9765 | 90.9996 | 91.0171 | 90.9722 |
| 300 | 92.1399 | 92.1811 | 92.2672 | 92.2523 |
CHM13
| actual read length | -r 50 |
-r 100 |
-r 150 |
-r 250 |
|---|---|---|---|---|
| 50 | 81.7377 | 80.3887 | 56.2612 | 16.966 |
| 100 | 88.7917 | 88.7451 | 88.1207 | 87.119 |
| 150 | 90.9775 | 91.0592 | 90.9505 | 90.7392 |
| 200 | 91.8551 | 92.0196 | 92.0511 | 91.9847 |
| 300 | 92.7327 | 92.9054 | 93.017 | 93.0309 |
I think the main observation here, which confirms what Kristoffer said, is that you lose very little accuracy when you choose an -r setting that is shorter than the actual read length. Especially if you want to also map short reads (50 nt), then you should not choose an -r setting that is much larger than the actual read length.
And of course overall accuracy is better for short reads in main, so maybe we should make a release soon.
Nice that you got an analysis on this done so quickly.
Yes, I am fine with making a new release as soon as possible.
@Rridley7 Note that above numbers are in single-end mapping only mode (PAF file), the extension alignment mode (SAM file) which is the default mode, will not yield as drastic differences, although there will still be improvements in latest main.