Hello, thanks for the work on developing this tool! I was wondering the potential issues which could arise from using a pre-built reference for different read length mapping (say mapping a avg 250 on r150 index) for a large library? I do not want to rebuild the index each time to increase speed, however keeping all the indexes on storage for each different size is somewhat inconvenient due to the reference library size.

Due to this, I wanted to potentially use a single reference group (r150) for the entire library , if it doesn't decrease accuracy too much.

Apr 08 '25 21:04 Rridley7

Hi @Rridley7 ,

The seeds at the mapping step must match the indexing step in order to get reasonable mapping results. So if your index was created with -r 150, then also specify -r 150 at the mapping step.

As for the accuracy decrease of 'forcing' r150 seeds from, say, 250nt reads, we haven't benchmarked how much accuracy/speed one lose (or gain). I think using a different seed to what has been optimized for has more impact on very short reads (say <75nt).

If I had to guess for your particular case (150 instead of 250), you may lose some percentage in speed (perhaps 5-15%?). As for accuracy, 150 uses shorter seeds than 250. Shorter seeds are more sensitive, but also more repetitive. So it is unclear whether accuracy would go up or down.

Apr 09 '25 09:04 ksahlin

I ran a small experiment on our fruit fly test and CHM13 datasets where I used different -r settings to map reads of different actual lengths.

The current stable release 0.15.0 shows quite different behavior than the development version, so I’m reporting both results.

The summary is that the accuracy is still ok if you use a setting for -r that is shorter than the actual read length. (In line with what Kristoffer said.)

Mapping accuracy for v0.15.0

fruit fly

actual read length	`-r 50`	`-r 100`	`-r 150`	`-r 250`
50	77.5834	66.8407	18.3421	2.5468
100	87.607	86.0051	82.3383	77.44
150	89.9249	89.5392	88.4716	87.3284
200	90.9968	90.9468	90.6826	90.2266
300	92.1635	92.1796	92.1974	92.2093

CHM13

actual read length	`-r 50`	`-r 100`	`-r 150`	`-r 250`
50	76.3266	66.2691	18.2674	2.5353
100	88.3019	86.9611	83.5238	78.5511
150	90.8824	90.6501	89.6729	88.5661
200	91.8716	91.9708	91.8055	91.356
300	92.7614	92.8975	93.0066	92.9967

Mapping accuracy for main (4fafa73)

Fruit fly

actual read length	`-r 50`	`-r 100`	`-r 150`	`-r 250`
50	83.9064	82.9692	61.5121	18.616
100	88.2011	88.0476	87.7574	87.3018
150	90.0482	90.0127	89.9331	89.823
200	90.9765	90.9996	91.0171	90.9722
300	92.1399	92.1811	92.2672	92.2523

CHM13

actual read length	`-r 50`	`-r 100`	`-r 150`	`-r 250`
50	81.7377	80.3887	56.2612	16.966
100	88.7917	88.7451	88.1207	87.119
150	90.9775	91.0592	90.9505	90.7392
200	91.8551	92.0196	92.0511	91.9847
300	92.7327	92.9054	93.017	93.0309

I think the main observation here, which confirms what Kristoffer said, is that you lose very little accuracy when you choose an -r setting that is shorter than the actual read length. Especially if you want to also map short reads (50 nt), then you should not choose an -r setting that is much larger than the actual read length.

And of course overall accuracy is better for short reads in main, so maybe we should make a release soon.

Apr 09 '25 10:04 marcelm

Nice that you got an analysis on this done so quickly.

Yes, I am fine with making a new release as soon as possible.

@Rridley7 Note that above numbers are in single-end mapping only mode (PAF file), the extension alignment mode (SAM file) which is the default mode, will not yield as drastic differences, although there will still be improvements in latest main.

Apr 09 '25 14:04 ksahlin

Potential issues in using same reference length for large libraries of variable read length?

Mapping accuracy for v0.15.0

fruit fly

CHM13

Mapping accuracy for main (4fafa73)

Fruit fly

CHM13