strobealign
strobealign copied to clipboard
All secondary alignments lost
In the following data: data.tar.gz
there is exactly one single-end short read in input.fasta and two sets of haplotypes, full.fa.gz and subset.fa.gz.
Running Strobealign v0.13 with the subset of haplotypes:
strobealign --no-progress -M 1000 -N 1000 -S 0.8 -f 0 -k 15 --eqx -r 150 -t 1 subset.fa.gz input.fasta > subset.sam
produces 90 alignments (one alignment per each haplotype). However, when running with a superset of haplotypes
strobealign --no-progress -M 1000 -N 1000 -S 0.8 -f 0 -k 15 --eqx -r 150 -t 1 full.fa.gz input.fasta > full.sam
only produces two alignments. Changing k-mer size to 13 solves the issue in this particular case but I was wondering if there is a bug in the code that leads to prematurely discarding some alignments.
If you need, I can send more reads from the same locus, that produce the same bug (only one/two alignments are found instead of 200+).
Thank you in advance, Timofey
Hi @tprodanov ,
It seems -k 13 does not solve this issue for me. Instead, changing -S to 0.55 or lower fixes the problem. For example, default parameter value (0.5) works, but 0.6 higher has the bug.
So we need to look into the -S parameter in more detail.
Thanks for reporting!
Best, Kristoffer
I only have access to machine with an M1 chip until end of next week so I can't do any further testing on the exact issue, i.e., why not all the locations would have the same number of hits. I think this is a crucial starting place to investigate: https://github.com/ksahlin/strobealign/blob/3a97f6b817824235a36ca8f9c5710ee533d3ad56/src/aln.cpp#L120C1-L121C65
I will have a look later.
Thank you! I looked a bit into this part of the code, and I think the problem is in the number of generated hits, but I did not go there yet. With the full reference panel best location has 9 hits and most have 5 hits, while with the subset panel best has 19 and most have 18 hits.
But also I guess I misunderstood the meaning of -S, maybe help message can be changed from "Try candidate sites with mapping score at least S of maximum mapping score" to something like "Try candidate sites with at least S minimizer hits compared to the best alignment".
Thanks for looking into this.
With the full reference panel best location has 9 hits and most have 5 hits, while with the subset panel best has 19 and most have 18 hits
It's strange that the mapping sites wouldn't all have the same number of hits if they are all identical and perfect. I will figure it out once I get to do some debugging.
But also I guess I misunderstood the meaning of -S, maybe help message can be changed from "Try candidate sites with mapping score at least S of maximum mapping score" to something like "Try candidate sites with at least S minimizer hits compared to the best alignment".
Yes. We have been trying different measures (score/hits) for cutoff, so the description to this parameter should be updated as you suggest.
I have looked into this. A detailed answer follows below for documentation.
TLDR
Heuristics are overriding user set parameter -f. It can be fixed by setting -R 3 or more, but we should revise repetitive thresholds in strobealign so that -f doesn't get overridden. A fix will come.
Suggested fixes based on this issue
-S: Should say "Try candidate sites with at least S seed hits compared to the best alignment".-f: Thefind_nam_rescuefunction should use a seed repetitive limit based on-fas a rescue_cutoff instead of R*filter_cutoff, if user specified-fimplies a higher limit. This should be fixed here https://github.com/ksahlin/strobealign/blob/3a97f6b817824235a36ca8f9c5710ee533d3ad56/src/nam.cpp#L231 . Probably most conveniently fixed by saving the cutoff value induced by-fhere https://github.com/ksahlin/strobealign/blob/3a97f6b817824235a36ca8f9c5710ee533d3ad56/src/index.cpp#L237 as a new entry 'rescue_cutoff' into the structureIndexCreationStatistics-RRemove this parameter? The parameter is too involved, probably rarely used, and would not be needed if we instead use-fas suggested in previous point.
Explaining the behaviour seen in the issue
First, here comes an explanation as to why two locations got 9 hits and the remaining got 5 hits. The behavior stems from the fact that strobemer seeds are not "fully" canonical (explanation at bottom), together with seed abundance cutoff heuristics in strobealign.
The read has 36 (18 fw 18 rc) seeds. Nine of those seeds have an abundance of 2 (see bottom), because two genome copies contain an inversion where the seeds are not masked. The remaining 27 seeds have abundance around 214-216 something.
Now, the second confusion is that the user set -f 0 is overridden. Strobealign uses an internal threshold 100 to redirect seed finding to a function find_nams_rescue - a more efficient function used when a read is considered to have a large fraction of seeds classified as repetitive (above filter_cutoff).
filter_cutoff = std::min(100U, filter_cutoff); // limit upper cutoff for normal NAM finding - use rescue mode instead
However, the find_nams_rescue function ignores the -f value. Instead, the function uses Rfilter_cutoff (-R is a rescue level parameter, default: 2). This function will collect all seeds below abundance Rfilter_cutoff` (2*100 for this data) and at most 5 seeds over this value if it has not already collected 5 seeds. Hence, all locations except the two inversion sites get 5 seeds, and the inversion sites get 9 seeds.
While we cannot easily fix a full canonicity of the seeds at the moment (we have been considering variants), strobealign should at least not override user preferences on -f. It is important to point out that when user set parameters are abided, all locations get 18-19 seeds.
Limited canonicity
Consider three consecutive syncmers (A,B,C). In strobemer seeding, in forward direction, strobe A may be paired with strobe C, creating strobemer (A, C), but in reverse direction, strobe C may be paired with strobe B, creating (C, B). So even though we design the hash computation so that strobemers (A, C) and (C,A) hash to the same value, it's not fully canonical because we have (A, C) and (C, B) as seeds in the different orientations. Seeds in the reference are computed in forward direction only, so e.g., the (C, B) seed might be unique to the two inversion locations. In a world without repetitiveness masking, this is not a problem (all hits in both directions match). However, if we mask all (A, C) because they are too repetitive it can create bias.
Thank you for looking into that! I will use the -R 3 fix for now then!
Great, and with more references just scale up R, where R*100 should be a value higher than the number of references to be sure.
Strobealign has another hard limit of 1000 (if ((rh.count > rescue_cutoff && cnt >= 5) || rh.count > 1000) {)which we should also consider making a user parameter for highly redundant references as a quick fix until better indexing solutions for highly redundant databases are explored.