tracy icon indicating copy to clipboard operation
tracy copied to clipboard

How are the base quality score generated?

Open nriddiford opened this issue 2 years ago • 3 comments

Hi,

I am using tracy assemble to assemble between 2 - 4 trace files. I am outputting the consensus as a .fastq file, and then aligning this to a reference sequence.

Downstream, I am performing some analysis that filters on per-nucleotide quality scores, and I am not sure that I understand how the these are translated from the base signal from the chromatogram to the base quality of the consensus calculated within tracy assemble. Typically, I only see 2 different base quality scores on a consensus (e.g. 19 and 24).

Do you have any insight into this?

I'm calling tracy like so:

tracy assemble \
            --format fastq \
            --inccons \
            --trim 3 \
            --outprefix ${colony_id} \
            colony_1_p1.ab1 colony_1_p2.ab1

nriddiford avatar Feb 15 '22 14:02 nriddiford

The quality scoring is indeed a bit of an issue because the input trace qualities are not very useful. The assemble command simply scales a flat quality prior by the fraction of traces supporting the consensus nucleotide. For 2 input traces, it is thus indeed only 1 or 2 traces supporting the consensus nucleotide. For more input traces, you should see a range of quality values.

tobiasrausch avatar Feb 16 '22 14:02 tobiasrausch

OK thanks - that's interesting. I'm using Tracy to detect errors in sequencing data, which can range from 1 trace (where I use basecall) to 4 traces (assemble).

As per your explanation, this sounds like forming a consensus between 2 traces for a given nucleotide doesn't consider the quality of the base call, and rather just looks at the fraction of traces involved in generating the consensus.

Below summarises my understanding for 4 different base quality configurations for the assembly of 2 trace files - is this accurate? To my mind, the 3rd and 4th scenarios should have lower quality values than the 1st.

Screenshot 2022-03-08 at 15 56 11

Part of the problem for me is that I want to have some estimate of the per-base quality score, so that I can confidently calculate the per-base error rate. In practice, this is hard using tracy because the quality scores change depending on how many trace files I use, and don't seem too comparable between a 2-trace assembly and a 4-trace assembly.

Is there a workaround?

nriddiford avatar Mar 08 '22 15:03 nriddiford

@tobiasrausch

the input trace qualities are not very useful

This piqued my interest, would you mind expanding on it a bit? In my department, one of the concerns I come across as a proponent of tracy is the lack of informative quality scoring and the fact that Ns appear in our sequences at a very very low rate compared to other basecalling algorithms - combined, these attributes make my colleagues cautious.

blex-max avatar Mar 08 '22 16:03 blex-max