tracy
tracy copied to clipboard
How are the base quality score generated?
Hi,
I am using tracy assemble
to assemble between 2 - 4 trace files. I am outputting the consensus as a .fastq
file, and then aligning this to a reference sequence.
Downstream, I am performing some analysis that filters on per-nucleotide quality scores, and I am not sure that I understand how the these are translated from the base signal from the chromatogram to the base quality of the consensus calculated within tracy assemble
. Typically, I only see 2 different base quality scores on a consensus (e.g. 19 and 24).
Do you have any insight into this?
I'm calling tracy
like so:
tracy assemble \
--format fastq \
--inccons \
--trim 3 \
--outprefix ${colony_id} \
colony_1_p1.ab1 colony_1_p2.ab1
The quality scoring is indeed a bit of an issue because the input trace qualities are not very useful. The assemble command simply scales a flat quality prior by the fraction of traces supporting the consensus nucleotide. For 2 input traces, it is thus indeed only 1 or 2 traces supporting the consensus nucleotide. For more input traces, you should see a range of quality values.
OK thanks - that's interesting. I'm using Tracy to detect errors in sequencing data, which can range from 1 trace (where I use basecall
) to 4 traces (assemble
).
As per your explanation, this sounds like forming a consensus between 2 traces for a given nucleotide doesn't consider the quality of the base call, and rather just looks at the fraction of traces involved in generating the consensus.
Below summarises my understanding for 4 different base quality configurations for the assembly of 2 trace files - is this accurate? To my mind, the 3rd and 4th scenarios should have lower quality values than the 1st.
Part of the problem for me is that I want to have some estimate of the per-base quality score, so that I can confidently calculate the per-base error rate. In practice, this is hard using tracy
because the quality scores change depending on how many trace files I use, and don't seem too comparable between a 2-trace assembly and a 4-trace assembly.
Is there a workaround?
@tobiasrausch
the input trace qualities are not very useful
This piqued my interest, would you mind expanding on it a bit? In my department, one of the concerns I come across as a proponent of tracy is the lack of informative quality scoring and the fact that Ns appear in our sequences at a very very low rate compared to other basecalling algorithms - combined, these attributes make my colleagues cautious.