tools-iuc icon indicating copy to clipboard operation
tools-iuc copied to clipboard

Correctly calculate phred scrore mean

Open rhpvorderman opened this issue 2 years ago • 14 comments

FOR CONTRIBUTOR:

  • [ ] - I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
  • [ ] - License permits unrestricted use (educational + commercial)
  • [ ] - This PR adds a new tool or tool collection
  • [x] - This PR updates an existing tool or tool collection
  • [ ] - This PR does something else (explain below)

Hi, this is a pretty nasty bug in the filter. Take this example:

A read with a length of two bases has quality scores 10 and 30 (+?). What is the mean?

Naive answer: 20.

Correct answer:

  • 10 stands for 10 ** (10 / -10) == 10 ** -1 == 0.1
  • 30 stands for 10 ** (30 / -10) == 10 ** -3 == 0.001
  • Sum: 0.1 + 0.001 = 0.101
  • Average = 0.101 / 2 == 0.0505.
  • Calculate back into phred score: -10 * log10(0.0505) ~= 12,97

Note that the naive answer mentions 20. Which stands for a probability of 0.01. Which means that the naive answer overestimates the quality by a factor of five! That is pretty big. I have a test FASTQ file on my system with 5 million reads (HiSEQ). None of them get filtered using the naive filter. About 4% gets filtered using the correct filter.

rhpvorderman avatar Dec 22 '21 14:12 rhpvorderman

Hmm, I'm not sure this should be fixed. You're of course correct with your PHRED score maths, however, the tool lets the user set a mean of scores threshold. What the current version does is in line with this description and easy to understand from a user's perspective. If the average base in a read has a score > the threshold the read will be kept. Your mathematically correct version is much harder to use in practice. In addition, if I remember my maths correctly, then calculating the arithmetic mean on the log-transformed values is equivalent to calculating the geometric mean of the original values? So, if anything, the tool is just ambiguous about what kind of mean it calculates but not outright wrong.

Maybe this whole thing just boils down to improving the tool help? Or if you really need this functionality, it should be added as an additional option?

wm75 avatar Dec 22 '21 18:12 wm75

Your mathematically correct version is much harder to use in practice.

If I put in 20 as the filter, I expect reads that on average have 1 in 100 bases wrong. As I showed in the example reads that have a 5 times worse score of 1 in 20 get passed as well. How is that helpful? How is that easy to use in practice?

Like I said, I filtered a normal FASTQ file with both filters, and the naive wrong filter did not filter anything, while the correct filter filtered 4%. What is the point then of having such an incorrect filter if it lets so much garbage trough?

EDIT: To further elaborate on this:

Quality scores arithmetic mean correct mean
20, 20 20 20.00
19, 21 20 19.89
18, 22 20 19.55
17, 23 20 19.04
15, 25 20 17,60
10, 30 20 12.97
5, 35 20 8.01
0, 40 20 3.01

Note how the same arithmetic mean can represent reads that differ wildly in quality. It is therefore simply not useful as a meaningful filter.

rhpvorderman avatar Dec 24 '21 07:12 rhpvorderman

Tool version needs to be bumped.

Yes, it should probably be a major version update given the change in behaviour.

Maybe this whole thing just boils down to improving the tool help?

Indeed it should be more specific about how a FASTQ Phred score mean should be calculated.

rhpvorderman avatar Dec 24 '21 07:12 rhpvorderman

Hmm, I'm not sure this should be fixed. (...) Your mathematically correct version is much harder to use in practice.

Galaxy is first and foremost a scientific tool. Correctness is a requirement. Practical use is a consideration of course, but it is meaningless if the tool is not correct.

rhpvorderman avatar Dec 24 '21 08:12 rhpvorderman

Galaxy is first and foremost a scientific tool. Correctness is a requirement. Practical use is a consideration of course, but it is meaningless if the tool is not correct.

Completely true. But I guess both @wm75 and @rhpvorderman have a point here.

IMHO the tool is correct at the moment. The tool form and help states that the mean is performed on the "quality score", ie. phred scores. But I think it's certainly worth to extend the help text in this respect.

I also think that it is worth to extend the tool to offer the possibility to calculate the mean on the transformed values.

This would maintain correctness, make the user aware of the differences, and also maintain reproducibility.

bernt-matthias avatar Dec 24 '21 08:12 bernt-matthias

I also think that it is worth to extend the tool to offer the possibility to calculate the mean on the transformed values.

This would maintain correctness, make the user aware of the differences, and also maintain reproducibility.

But that would offer these approaches as both being viable alternatives to filter. They are not. One approach is correct, the other is not.

As I showed in my examples, taking the arithmetic mean leads to all sorts of wildly variable and inpredictable results. It has no place in a scientific analysis. As such it should not be offered to the users. This PR is a bugfix. It is not a feature enhancement.

I do agree that the wording in the manual should be more clear. FASTQ Phred scores are confusing and unintiuitve because they are log scores. They are only log scores because that allows for a broad range of scores to be stored in the ASCII printable range. In any other format they might just have used a 16-bit floating point value to store the probability directly, and we might not have had this conversation.

and also maintain reproducibility.

Since the version number is bumped reproducibility will still be maintained. Reproducibility is only guaranteed when using the same version of a tool.

rhpvorderman avatar Dec 24 '21 10:12 rhpvorderman

@rhpvorderman I do understand your point and appreciate your work here. Please also try to understand my argument: as I said earlier the tool as it is calculates the geometric mean of the probabilities and you cannot call this wrong behavior. For example, with quality scores of 0,30,30 the tool calculates (0+30+30)/3=20, i.e. probability 0.01, which is (1*0.001*0.001)**(1/3). Explaining this to the user and offering both types of stats is the right way forward imo.

wm75 avatar Dec 24 '21 11:12 wm75

But that would offer these approaches as both being viable alternatives to filter. They are not. One approach is correct, the other is not.

As I showed in my examples, taking the arithmetic mean leads to all sorts of wildly variable and inpredictable results. It has no place in a scientific analysis. As such it should not be offered to the users. This PR is a bugfix. It is not a feature enhancement.

I do agree that the wording in the manual should be more clear. FASTQ Phred scores are confusing and unintiuitve because they are log scores. They are only log scores because that allows for a broad range of scores to be stored in the ASCII printable range. In any other format they might just have used a 16-bit floating point value to store the probability directly, and we might not have had this conversation.

You certainly have a point here. Still the tool does what it says in a mathematically correct way (but help needs to be improved) even if it may/is (considered) wrong, unintuitive, or inpredictable. So maybe we shouldn't forbid users to do "wrong" things .. but we should guide them by extending the help.

With the change the option mean of scores does not calculate the mean of scores, but the mean of the error probabilities converted to a score.

Furthermore other tools (I know this might be a weak argument) are calculating this average in the same way, e.g.

  • https://github.com/OpenGene/fastp/commit/a540c613f65a7fc7700d9717eb1c8d081047eeef
  • https://github.com/s-andrews/FastQC/blob/bfde9d31aa94f29e9c58b6e91574a08e57335a4a/uk/ac/babraham/FastQC/Modules/PerSequenceQualityScores.java#L59 (not sure if this is the correct line in the code .. I can't/won't read java)

so users might expect the calculation this way..

Since the version number is bumped reproducibility will still be maintained. Reproducibility is only guaranteed when using the same version of a tool.

You got a point here.

So, I'm still arguing in favor of offering both options with extended help (which may even contain you example as a warning).

bernt-matthias avatar Dec 24 '21 11:12 bernt-matthias

Wondering what the meaning of both means is anyway... is it a good idea / mathematically correct to sum (transformed) probabilities?

bernt-matthias avatar Dec 24 '21 11:12 bernt-matthias

Please also try to understand my argument: as I said earlier the tool as it is calculates the geometric mean of the probabilities and you cannot call this wrong behavior.

First: the tool states it calculates the "mean" and not "geometric mean", so that it does calculate the geometric mean is an unintentional effect.

The intent of the user (in my opinion) is to filter out reads with very low quality from those with high quality. As I showed in my examples, the geometic mean cannot properly discern between reads which have wildly differing qualities. To add one extra example. A read with half of the bases quality 40 and the other half quality 0. In this read half of the bases are probably wrong. This is a very very low quality read. Still it passes a filter of geometric mean 20. Is this the users intention? It will definitely not pass the new filter.

So I cannot understand your argument. Which users are served by giving them the geometric mean? What scientific workflows benefit from doing it this way? Why should the geometric mean be an option in this tool?

Furthermore other tools (I know this might be a weak argument) are calculating this average in the same way, e.g.

Yes, that is completely understandable. It is the most intuitive way to do it. Everyone falls for this trap unless they have been forewarned or investigate the issue very closely. Case in point: https://gigabaseorgigabyte.wordpress.com/2017/06/26/averaging-basecall-quality-scores-the-right-way/. This tool also used the arithmetic mean until the author noticed some irregularities and fixed the averaging to be correct.

The mean as it is calculated now is just not helpful for users. It does not mirror their intentions of getting rid of the low quality reads (see the table I posted, reads with a quality as low as 3 get passed trough a filter of 20...). I am perfectly willing to adapt this PR to accomodate all the requirements, but I will not work on something that presents the geometric mean as a proper filter option. I feel that that would actively harm users by not doing what they want from a filter (see examples above).

rhpvorderman avatar Dec 24 '21 12:12 rhpvorderman

How do we proceed here? Can we add both and document both?

bgruening avatar Jan 12 '22 18:01 bgruening

How do we proceed here? Can we add both and document both?

Would be my favored solution.

@rhpvorderman sorry for not answering:

First: the tool states it calculates the "mean" and not "geometric mean", so that it does calculate the geometric mean is an unintentional effect.

We have to be careful with the wording here. If you write that it calculates the mean and not the geometric mean then this statement is incomplete, because it misses the info of which values mean or geometric mean is calculated. The tool states that it calculates the mean of the (phred) scores .. which is what it does

In essence there is

  1. the mean of the (phred) scores (which is identical of the geometric mean of the probabilities)
  2. the (phred score of the) mean of the probabilities

The intent of the user (in my opinion) is to filter out reads with very low quality from those with high quality.

Probably correct, but I would not dare to speculate about the intent of users. We always need to make the wording clear such that the user really gets what the tool promises. At the moment this is the case since the tool states to calculate the mean of the scores and it does calculate the mean of the scores.

With you change the tool would change behavior (since it would calculate the phred score of the mean of probabilities) without changing the text.

As I showed in my examples, the geometic mean cannot properly discern between reads which have wildly differing qualities. To add one extra example. A read with half of the bases quality 40 and the other half quality 0. In this read half of the bases are probably wrong. This is a very very low quality read. Still it passes a filter of geometric mean 20. Is this the users intention? It will definitely not pass the new filter.

Your examples in the help text would be of great value for users to help them make a good choice. Which likely will coincide often with the phred score of the mean probabilities.

As a last note. I favor to include both options because it appear more transparent to the user. Even if we change behavior and the text of the option/select. Frequent users of the tool will just keep using option number 3, and likely overlook the change. Same holds for workflows.

bernt-matthias avatar Jan 12 '22 19:01 bernt-matthias

@bernt-matthias You have convinced me. I have no time to work on this now, but I will ping this thread when I have updated the code. Thank you for taking your time on this. I know I was really stubborn. Sorry for that.

rhpvorderman avatar Jan 14 '22 06:01 rhpvorderman

I have been thinking about this a bit more lately. I have concluded that using the Phred score in the user interface is bound to lead to bad decisions. It is much better to remove these options. Instead filters should work on "average error rate". (With some help stating that Q=30->error rate 0.001). That way the result will feel more intuitive for the user as well as being correct and meaningful.

rhpvorderman avatar Jun 01 '22 06:06 rhpvorderman