strobealign icon indicating copy to clipboard operation
strobealign copied to clipboard

Potential issues in using same reference length for large libraries of variable read length?

Open Rridley7 opened this issue 9 months ago • 3 comments

Hello, thanks for the work on developing this tool! I was wondering the potential issues which could arise from using a pre-built reference for different read length mapping (say mapping a avg 250 on r150 index) for a large library? I do not want to rebuild the index each time to increase speed, however keeping all the indexes on storage for each different size is somewhat inconvenient due to the reference library size.

Due to this, I wanted to potentially use a single reference group (r150) for the entire library , if it doesn't decrease accuracy too much.

Rridley7 avatar Apr 08 '25 21:04 Rridley7

Hi @Rridley7 ,

The seeds at the mapping step must match the indexing step in order to get reasonable mapping results. So if your index was created with -r 150, then also specify -r 150 at the mapping step.

As for the accuracy decrease of 'forcing' r150 seeds from, say, 250nt reads, we haven't benchmarked how much accuracy/speed one lose (or gain). I think using a different seed to what has been optimized for has more impact on very short reads (say <75nt).

If I had to guess for your particular case (150 instead of 250), you may lose some percentage in speed (perhaps 5-15%?). As for accuracy, 150 uses shorter seeds than 250. Shorter seeds are more sensitive, but also more repetitive. So it is unclear whether accuracy would go up or down.

ksahlin avatar Apr 09 '25 09:04 ksahlin

I ran a small experiment on our fruit fly test and CHM13 datasets where I used different -r settings to map reads of different actual lengths.

The current stable release 0.15.0 shows quite different behavior than the development version, so I’m reporting both results.

The summary is that the accuracy is still ok if you use a setting for -r that is shorter than the actual read length. (In line with what Kristoffer said.)

Mapping accuracy for v0.15.0

fruit fly

actual read length -r 50 -r 100 -r 150 -r 250
50 77.5834 66.8407 18.3421 2.5468
100 87.607 86.0051 82.3383 77.44
150 89.9249 89.5392 88.4716 87.3284
200 90.9968 90.9468 90.6826 90.2266
300 92.1635 92.1796 92.1974 92.2093

CHM13

actual read length -r 50 -r 100 -r 150 -r 250
50 76.3266 66.2691 18.2674 2.5353
100 88.3019 86.9611 83.5238 78.5511
150 90.8824 90.6501 89.6729 88.5661
200 91.8716 91.9708 91.8055 91.356
300 92.7614 92.8975 93.0066 92.9967

Mapping accuracy for main (4fafa73)

Fruit fly

actual read length -r 50 -r 100 -r 150 -r 250
50 83.9064 82.9692 61.5121 18.616
100 88.2011 88.0476 87.7574 87.3018
150 90.0482 90.0127 89.9331 89.823
200 90.9765 90.9996 91.0171 90.9722
300 92.1399 92.1811 92.2672 92.2523

CHM13

actual read length -r 50 -r 100 -r 150 -r 250
50 81.7377 80.3887 56.2612 16.966
100 88.7917 88.7451 88.1207 87.119
150 90.9775 91.0592 90.9505 90.7392
200 91.8551 92.0196 92.0511 91.9847
300 92.7327 92.9054 93.017 93.0309

I think the main observation here, which confirms what Kristoffer said, is that you lose very little accuracy when you choose an -r setting that is shorter than the actual read length. Especially if you want to also map short reads (50 nt), then you should not choose an -r setting that is much larger than the actual read length.

And of course overall accuracy is better for short reads in main, so maybe we should make a release soon.

marcelm avatar Apr 09 '25 10:04 marcelm

Nice that you got an analysis on this done so quickly.

Yes, I am fine with making a new release as soon as possible.

@Rridley7 Note that above numbers are in single-end mapping only mode (PAF file), the extension alignment mode (SAM file) which is the default mode, will not yield as drastic differences, although there will still be improvements in latest main.

ksahlin avatar Apr 09 '25 14:04 ksahlin