fastp
fastp copied to clipboard
Quality scores encoding
Hello,
I am cleaning some Illumina data, some of them are pretty old, and I get quality of 70 in fastp report. When I analyze the same file with fastqc, quality score is 40 max and it found the encoding to be Illumina 1.5.
Is there a way to know what encoding was found by fastp ? Do you know what is happening here ?
Thanks
@aureliendejode
Is there a way to know what encoding was found by fastp ?
If you know the true encoding, the worst quality score under that encodiding, and the worst quality reported by fastp, you should be able to infer the encoding assummed by fastp from the offset: In your case, a 40 turns into a 70, so you have a shift of 30. If FASTQC determined the encoding to be Illumina 1.5, that implies a 'correct' offest of 64 (See Wikipedia for an overview of PHRED encodings). So to get to the 70, fastp must have used an offset of 34. However, I find it likely, there is an off-by-one error here and the actual offset used by fastp was 33, which would correspond to the Sange/Illumina 1.8/PacBio encoding.
If I have an old datset using an unknown encoding, I typically check for reads that are all 'N's (unfiltered raw sequencing data usually contain at least a few failed clusters that get reported as such) and inspect their quality score: In my experience this always corresponds to the worst quality score possible under the encoding used. So if the all-'N' reads have an all-'I' quality, I interpret this as Sanger encoding being used. If they are all-'h', Solexa or Illimina 1.3 remain, in which case I usually know the platform that was used. Analogously, all-'j' quality all-'N' reads indicate Illumina 1.5 and all-'J' ones Illumina 1.8 encoding. With PacBio I don't know if one can expect such indicative reads but the read length alone should make it obviuous in that case.
I hope this helps, Marcel
Specify --phred64 option if the data is Illumina 1.5, it will be converted to phred33 in the output.