strobealign icon indicating copy to clipboard operation
strobealign copied to clipboard

Adressing Issue #75 with some initial refactoring, both functions now…

Open PaulPyl opened this issue 1 year ago • 5 comments

… call a helper that returns a reference to a vetor of alignments and align_PE simply returns the best (by SW score as it did before) while align_SE_secondary_hits does it's usual thing with the vector.

I am relatively certain that this will do the same thing as before, however since we now have unit tests, adding some unit tests that check that alignments look as they should with test data could be really good.

PaulPyl avatar Sep 15 '22 10:09 PaulPyl

Oh great, that’s a lot of duplicated code we can get rid of! The tests don’t pass, though, so something is different. It appears to be different mapping qualities. Here’s how I usually run the tests locally:

export PATH=$PWD/build:$PATH
make -C build && tests/run.sh

The first line is currently necessary; however, I think I should rewrite the run.sh script so that it does not require strobealign to be in the PATH.

Since this is just a refactor, we definitely need a test that verifies that the output is the same as before when running with -N 2. The easiest way to do this would be to add yet more strobealign invocations to tests/run.sh. Feel free to do so, but we I think that the number of files in tests/ is getting too big too fast, and we need to find a better way for testing. I’ll open a separate issue about this.

BTW, feel free to use shorter branch names. IMO, they don’t need to be very descriptive because the name disappears once the PR is merged. I think it is more valuable to be able to type them in on the command line without resorting to copy&paste.

marcelm avatar Sep 15 '22 13:09 marcelm

I will keep on working on this next week, I am off tomorrow and Monday. But there is a get_alignment and get_MAPQ function used by align_PE that I want to use for this, since it would make sense. However right now using those gives me the wrong mapping qualities so I will have to do some more digging first. I think this might overlap with the issue about MAPQ being 0 sometimes (#52).

I created the branch with a button on GitHub to connect it to the issue and it set the name like that, but indeed it is a bit overkill :)

PaulPyl avatar Sep 15 '22 18:09 PaulPyl

A curious observation here is also, that when I compare my SE result with the test SE result and the test PE result for the same read, the MAPQs are not identical in the test data either:

(base) paul@Paul-Theodors-MacBook-Pro build % grep "SRR1377138.6\t" test.se.sam                                     
SRR1377138.6	0	NC_001422.1	649	60	1S184=1X115=	*	0	0	[...]

(base) paul@Paul-Theodors-MacBook-Pro build % grep "SRR1377138.6\t" ../tests/phix.pe.sam                            
SRR1377138.6	99	NC_001422.1	649	0	1S184=1X115=	=	719	-371	[...]
SRR1377138.6	147	NC_001422.1	719	0	30=1X83=1X185=1S	=	649	371	[...]

(base) paul@Paul-Theodors-MacBook-Pro build % grep "SRR1377138.6\t" ../tests/phix.se.sam
SRR1377138.6	0	NC_001422.1	649	8	1S184=1X115=	*	0	0	[...]

So now I get a 60, the SE test data wants and 8 and the PE test data a 0, presumably the mapping quality should not differ between SE and PE mode? at least the CIGAR strings are all the same and the locations as well.

Anyway, I will poke around get_MAPQ for a bit more to see if I can find out what is happening here.

PaulPyl avatar Sep 16 '22 06:09 PaulPyl

Without commenting at a detailed level. I do think that MAPQ is different if there is extra confidence for the mated read being aligned consistently nearby. This is how i remember trying to implement it. I think the other aligners would do that too (think I looked at BWA mem for this), although I am not sure.

ksahlin avatar Sep 16 '22 08:09 ksahlin

Nevermind previous answer (was before coffee :). While my comment still holds, your particular set of observations indeed look suspicious - PE having lower confidence than SE. MAPQ is indeed something we should look at in detail. Your observation goes in the same trend as our variant calling results, strobealign has high precision but a bit lower sensitivity. I think more consistent MAPQ would improve these results.

ksahlin avatar Sep 16 '22 08:09 ksahlin

I didn't have a lot of time this week to check it more clearly but I was wondering about the current implementation of MAPQ as // MAPQ = 40(1−s2/s1) ·min{1,|M|/10} · log s1 in the get_MAPQ function.

Where did this definition arise? As I understand it we would want to put our best guess for the probability that the alignment is wrong in the MAPQ score. So here we take 1 minus the ratio of best to second-best alignment and scale that to 40, i.e. if the alignments are equally good then we end up with MAPQ 0. Since 0*.

Now here is a fun question, if I have exactly two alignments and they are equally good and not discarded because of bad alignment score, would I assume the probability of a given alignment being wrong as 1, implying a MAPQ of 0, or would I assume the probability is 1/<#equally good alignments> and in this case 0.5 with a resulting MAPQ of around 3?

It seems like there is a bit of an issue with this, in the sense that there are no 'correct' MAPQ scores we can benchmark against, since that really depends on the algorithm and our choices.

For reference: -10*log10(c(0.0001,0.5,0.9999,1.0)) 40 3.01029 0.00043 0

PaulPyl avatar Sep 22 '22 09:09 PaulPyl

Not at a computer so excuse my brevity and guessing.

First, our formula is from minimap2 I believe (see paper) if my memory is correct. If not, it’s from BWA.

Good example. I think this is where MAPQ scores from BWA/Bowtie2/tophat differs. Tophat uses the p=0.5 -> MAPQ=3 while bowtie or BwA use 0 whenever multiple identical. I’m not sure what makes sense. Following the initial interpretation of MAPQ, 3 would perhaps mimic the idea of the formula better.

see paper in issue #25 for some ideas.

ksahlin avatar Sep 22 '22 09:09 ksahlin

Thanks, I will have a look over the paper as well. Maybe we need to have a discussion at some point to define what we want to go with. It might just be doing the same as minimap2?

PaulPyl avatar Sep 22 '22 09:09 PaulPyl

I think we can do better because our seeding is different to minimap2 and we should probably should take advantage or at least customize for that.

If any mapper is worth replicating, it is BWA-MEM (see the Twitter thread linked to in issue #25 ). Basically, for variant calling, it matters more that the aligner is better predicting whether it is correct rather than being actually correct with its mappings (thread summary). This is where BWAMEM shines.

so a first concrete goal is to identify and fix clear differences to BWA-MEM. For example, the instance you identify above seems not to make sense in strobealign. What is the BWA-MEM mapping equivalent?

ksahlin avatar Sep 22 '22 10:09 ksahlin

I’ll try to move the discussion over to #25 (discussion in this PR should preferably be only about the refactoring itself).

marcelm avatar Sep 23 '22 10:09 marcelm

Paul, your most recent commit made it so that all the changes are reverted (so the PR is "empty"). What are your plans for this PR?

marcelm avatar Oct 10 '22 08:10 marcelm

I commit here the changes I wanted to do to align_SE, making it use get_alignment and get_MAPQ internally. The checks fail because the MAPQ values are not identical, but since there is no real ground truth this is a bit of an arbitrary hurdle, at least until we have defined how our MAPQ calculations should be done (as we discuss in #25). Fun fact, since I merged the main branch I can't compile it on my mac anymore, I wonder if anyone else has a mac to check if this is a general problem or if I just messed up my local environment somehow o.O

PaulPyl avatar Oct 18 '22 07:10 PaulPyl

@PaulPyl , I am also having problems installing on mac as I reported in https://github.com/ksahlin/StrobeAlign/issues/102. I haven't tried to compile strobealign since before summer though, so cannot say whether this is a recently introduced problem.

ksahlin avatar Oct 21 '22 13:10 ksahlin

Ok, this PR now compiles again but the checks fail because the MAPQ values are not matching anymore. This relates to #25 but I am unsure how to proceed. We should rework aln.cpp so that the align_SE function calls get_alignment and get_MAPQ and then make sure those two work as expected. For get_MAPQ we maybe need to define what "as expected" would mean.

PaulPyl avatar Oct 26 '22 10:10 PaulPyl

I wish I had a good solution for this.

I think the MAPQ question is more of an open-ended research question. Do the reads seem to have a more logical MAPQ value after your fixes in this pull request? If so I wouldn't mind a merge.

It is on my plate to implement a smaller benchmark for this (in addition to rerunning the larger benchmarks I used for the paper).

What does Marcel think?

ksahlin avatar Oct 26 '22 13:10 ksahlin

I closed this and deleted the branch, I think that issue needs a new run at it. Maybe after addressing the MAPQ score calculation.

PaulPyl avatar Oct 27 '22 13:10 PaulPyl

Hm, sorry; it looked so nice to get rid of this amount of code ... maybe we should try to duduplicate/refactor this in smaller steps?

Please let me know when/if you start to work on this again. I wanted to do a couple of things that touch the align_SE function and don’t want to cause unnecessary merge conflicts.

marcelm avatar Oct 27 '22 22:10 marcelm

You can go ahead and work on align_SE for now, I will let you know if I come back to it.

PaulPyl avatar Oct 28 '22 08:10 PaulPyl