vsearch
vsearch copied to clipboard
Why FASTQ quality value above qmax is treated as Fatal error?
When I used --fastx_filter
to trim FASTQ files (seeming like a very easy task),
$ vsearch --fastx_filter in.fastq --fastq_stripleft 20 --fastq_stripright 20 --fastaout out.fa
I encountered Fatal error as
Fatal error: FASTQ quality value (75) above qmax (41)
By default, quality values range from 0 to 41.
To allow higher quality values, please use the option --fastq_qmax 75
So I added --fastq_qmax 75
. However another Fatal error occurred
Fatal error: FASTQ quality value (76) above qmax (75)
By default, quality values range from 0 to 41.
To allow higher quality values, please use the option --fastq_qmax 76
An example is https://www.ncbi.nlm.nih.gov/sra/SRR19317902, where its reads look like
@SRR19317902.1 M05585:4:000000000-BPG86:1:1101:8340:1726_CONS_SUB_SUB_CMP length=313
TTTATCTTCATCTATTGGAGCTCCAGGTCCTGCAGTTGACTTGGCTATTTTTTCACTTCACATAGCTGGTGCATCTTCAATTTTAGGTGCTATTAACTTTATCACTACTATATATAATATGCGTGCCCCTGGAATGACACTTCACAAAATGCCTTTGTTTGTTTGGTCCATTTTAGTAACGGCTTTTTTATTACTACTATCCTTACCAGTATTAGCTGGTGCAATCACTATGCTTTTAACCGATAGAAACTTTGGTACAGCTTTTTTTGCAGCAGAAGCTGGAGGTGATCCAGTACTTTTCCAACATCTATTT
+SRR19317902.1 M05585:4:000000000-BPG86:1:1101:8340:1726_CONS_SUB_SUB_CMP length=313
HGIHIHHIIIIIIIHIIHIIIFHIHI:GHFFGHIIHE.EEHEHHBHHHIHHIII>BHHIEEEGIHIHIIIIIIIHHI;HIIIIIHIIIHHIH;GHCGHIlef\abmllj_hQ`ddgX\h]ghigddkkalkab]lfgkcg`lmllmlhi`bYi]imlljblilg^Yll^cklkki^lkmliifhhjmmmmmlmli^^_hmd]hlmlllmlllgIIIHHFIIGHHDCHHGIIIEGCIEHEIIIIIIIIIIIIHIHEIIIIIIIIIHIHHIIIIIHIIIHIIIIIIIIIHIHEIIHHHB>HHHIIHIIHHHIIIH
@SRR19317902.2 M05585:4:000000000-BPG86:1:1101:19814:1776_CONS_SUB_SUB_CMP length=313
ATTATCATCTAAAACCGGACAACCTGATCCAGCAATGCATTTAGCTATTCTGGCGCTACACGTAGCTGGAGCTTCTTCAATTTTAGGACCGATAAATTTTATAACTACAATTCTAAACATGAGAACTCCTGGTATGACACTACATAAGATGCCACTATTTTCATGGTCAATTTTAGTAACTGCGTTTTTACTTCTATTATCTTTACCAGTTTTAGCGGGAGCAATTACAATGCTGCTCACAGACAGAAATTTTGGAACAGCTTTTTTTGAAGCTCAAACAGGAGGAGACCCAGTTTTGTTTCAACATTTATTT
+SRR19317902.2 M05585:4:000000000-BPG86:1:1101:19814:1776_CONS_SUB_SUB_CMP length=313
:G;HEH.=8E=8E>8.9BHEB8..E...EB.8.GEH..E.8BH.8.;E>G..;.-:?-9G><.C6HH>.;D>.D.C;H...E>H...E.;6-B-6.7.9b]Naha^RGfRPb^j]GhgllDKCQUIe\kmcZlll]EKj_`QDedli[E[KHXN,UQ^ajklQbVPcdeGfQXQMUQc:[@4S`EH^k\OPcg[OUjgEcFagjFXEVRQ?MS-@9HHAH;IIIHHGHAHE.HH?E.EE.E.;.>>H?.H.AH>A>IHIIHEH>HCH;E<8..IHEG8EIIIGGE8E..:IEHG8EIIIIIGHIHE>HHEGEH
@SRR19317902.3 M05585:4:000000000-BPG86:1:1101:10609:1789_CONS_SUB_SUB_CMP length=313
TCTAAGTCATATTACGAGCCACTCTGGCGGTGAAGATGATTTAGCTATTTTCAGTTTACATTTATCTGGAGCGAGCAGTATTTTAGGGGCGATTAACTTTATTACCACTATCTTTAACATGCGTGGGCCTGGTCTGGGCTTCCATCGCTTACCTTTATTTGTGTGGTCTGTTCTGATTACGGCCTTCTTACTTCTTCTCTCTTTACCGGTTTTAGCGGGAGCGATTACGATGCTGTTAACGGATCGTAACTTCAATACCTCGTTCTTTGATCCGGCGGGCGGAGGTGACCCTATTTTGTGCCAGCACCTTTTT
+SRR19317902.3 M05585:4:000000000-BPG86:1:1101:10609:1789_CONS_SUB_SUB_CMP length=313
-BGHECE>;;GH;.GB.B--..8>.E>.----.EH.E....E8>E88;.EG;E>EE.G>HG;8EAACA..;.:BH9-D..7>>7EH:FE<<-B>8.<G.f[kkillh[lRR`NN]g[PgfljmbjijZbQcMQZ;>K`PLLaP\I_jcNOS]<]G7OORRZ[]VROMQbkgia]eik]\IaZhVjcdaQ`jhiY^ciaklibdkRehkXYXL^IEEH9IHIF>E>HIGHH:HIHHFGIH.-HEH<HHGHCHHIH;D6??EIIEBH.IHBH<E9<EGGH<E>GH8H>HHCGCEGB8>B>E.HE8:>EB9EGIHH
Above all, what I wondered is why FASTQ quality value above qmax is treated as Fatal error? I had thought that any quality values larger than 41 would be cut off to 41.
In addition, I tried to use --fastq_chars
and it gave the following results
Reading FASTQ file 100%
Read 54687 sequences.
Qmin 42, QMax 109, Range 68
Guess: -fastq_qmin 9 -fastq_qmax 76 -fastq_ascii 33
Guess: Illumina 1.8+ format (phred+33)
Therefore, is it the best practice to use --fastq_chars
at the first step to decide -fastq_qmin
, -fastq_qmax
and -fastq_ascii
, and then continue with the downstream analysis?
Above all, what I wondered is why FASTQ quality value above qmax is treated as Fatal error?
Reads with quality values encoded on more than 41 levels are relatively recent. Observing values outside the 0-41 range used to be the sign that something went wrong. Nowadays, maybe 0-93 should be the new default range for quality values?
Therefore, is it the best practice to use --fastq_chars at the first step to decide -fastq_qmin, -fastq_qmax and -fastq_ascii, and then continue with the downstream analysis?
Yes, finding the qmax
value beforehand could avoid a fatal error downstream. One possibility is to capture the value in a variable:
QMAX=$(printf "@s1\nA\n+\nI\n" | \
vsearch --fastq_chars - 2>&1 | \
grep -o -P "fastq_qmax *\K[[:digit:]]+")
Here QMAX
would contain the value 40
.
maybe 0-93 should be the new default range for quality values?
@torognes what do you think of that?
I agree that the world has moved forward and that some FASTQ files now contain q values above 41.
I also agree that the intention of this option was to help users detect old FASTQ files (phred 64), which is hardly a problem anymore.
Setting default qmax to 93 would essentially remove its effect, as it is the highest possible quality value obtainable with any printable character.
I am a bit sceptical to change default values and behaviour like this without a major version change, since it could potentially change results. But the change would only be to allow some files to be analysed which would previously fail.
Compatibility with usearch is another issue. I haven't check what it does.
Another possibility is to issue a warning instead of a fatal error. Perhaps a warning at the first encounter and then another one after reading the whole file with the maximum q value.
@torognes I agree with issuing a warning instead of a fatal error. Fatal error is too nervous
Hello
i also encountered this problem like when i wanted to convert my q20 fastq files to fasta files.
with this input
vsearch --fastq_filter BAR_trimmed_max950_q20.fastq --fastq_qmax 99 -fastaout BAR_max920_q20_qmax99.fasta
i got this output
Sum of arguments to --fastq_ascii and --fastq_qmax must be no more than 126
i then changed --fastq_max to 90
what actually happens to the reads when i filter for a quality score of q10 again after previously filtering them for a quality score of q20?