hts-specs icon indicating copy to clipboard operation
hts-specs copied to clipboard

Specifications for Base Calling Accuracies Across Platforms - Suggestion

Open husamia opened this issue 2 years ago • 3 comments

Base calling, is the systematic assignment of nucleobases to chromatogram peaks in case of electrophoresis, current changes resulting from nucleotides passing through a nanopore in case of Nanopore sequencing, or the assignment from images from cycles with phred score as quality metric to the sequencing by synthesis used by Illumina. Base callers for Nanopore sequencing use neural networks trained on current signals obtained from real sequencing data.

I am suggesting to add a disclaimer in the specifications. There is no single accuracy metric that standardizes the base calling accuracy across platforms. Continuous claims from manufacturers about improved accuracy is misleading.

husamia avatar Jan 17 '22 16:01 husamia

I accept that it can be hard to directly compare figures between different manufacturers, as they may not be calibrated, but the meaning of the Phred score is defined and poorly calibrated implementations isn't something the spec should be compensating for. Claims of improved accuracy are generally considered in the light of their own previous base-callers on the same technology.

There is however some room for nuance in what an error really means. For example PacBio uses qualities all the way up into the 90s. So that's 1 in 10^9 chance of an error. That's unrealistic, both in terms of real accuracy (ie poor calibration), but also there are likely to be library preparation errors that cause a denovo base mutation at a higher rate than 1 in 10^9. So what is the error really describing - the total chance of an error of DNA collection to BAM file, or just the final sequencing component post library creation? If the qualities are reasonably calibrated though I doubt that becomes a major issue.

Edit: also note it's less clear how a quality value for a base incorporates the possibility of overcall and undercall. We could define it to be "given the assumption this base exists, this is the probability of it being correctly called", or preferably it could be some compound probability including the possibility it doesn't exist at all (overcall). Undercall is more debateable. The base may well be correct, but we've omitted reporting an additional neighbouring base. (See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3534400/ for an explicit example where phred with overcall / undercall has been a problem.)

So I think we do have room for improvement in our definitions, but I'm not in favour of adding disclaimers.

jkbonfield avatar Jan 17 '22 16:01 jkbonfield

How about we suggest in the specifications a reference to the source of the base calls such as the software release date, training model, and instrument?

husamia avatar Jan 17 '22 18:01 husamia

The problem with base calling software is they typically output in FASTQ, which has no real concept of meta-data and headers. Consequentially the downstream processes that create SAM et al usually lose track of the upstream software that produced the data unless someone is being very conscious of data provenance and retrofitting these fields after aligning.

I absolutely agree though it would be great to track such thing and there is already provision to do this via @PG. Unfortunately realistically I don't see it happening, irrespective of whether we make specific recommendations. The methods exist already, if people are interested enough. "You can lead a horse to water, but you cannot make it drink".

There is a "Recommended Practice for the SAM Format" section of the SAM spec, which perhaps could be strengthedn with more recommendations, such as emphasising the important of data provenance via PG lines indicating both software names and versions, and my own pet peeve would be to recommend UR and/or M5 strings for SQ lines so the reference sequences used are unambiguous rather than some anonymous "chr1". Generally this section of the spec has been quite weak though as we've kept clear from more discussion oriented things.

jkbonfield avatar Jan 18 '22 09:01 jkbonfield