Fastp 1.0.1 throwing reads away when encountering empty read
Hello, This is an issue related to #534 and #560 , but I thought it was worth raising a separate issue as it is not quite the same. I have several fastqs with empty sequences, as described in #560 , e.g.
@K00371:221:H2NKWBBXY:6:1104:25824:8260 2:N:0:AGTACAAG
AGGCCAACAGGTAGGTCTCTGAAAAATGAAGAACAGATATTCATAAGCTATAATGAAATAATTCAAACTTATTTCATTACCTCCCTTGAATACAGACTA
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJ
@K00371:221:H2NKWBBXY:6:1104:25986:8260 2:N:0:AGTACAAG
+
@K00371:221:H2NKWBBXY:6:1104:26250:8260 2:N:0:AGTACAAG
ATTTAGTATAATAAACATTACCAAATCTTTCTTTCCTAAGGCACCATTCTGATTTATAGGTCAGGCTGCCTGACTCTAAGGAAATAACTGGTAAGGATAC
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJ
@K00371:221:H2NKWBBXY:6:1104:26880:8260 2:N:0:AGTACAAG
Running different versions of fastp results in different outcomes for this file. Indeed, with version 0.24.1 I get the error
@K00371:221:H2NKWBBXY:6:1104:25986:8260 2:N:0:AGTACAAG
Expected '+', got
ERROR: '+' expected
With the latest version 1.0.1 I get a warning instead, but the run finishes:
@K00371:221:H2NKWBBXY:6:1104:25986:8260 2:N:0:AGTACAAG
Expected '+', got
Your FASTQ may be invalid, please check the tail of your FASTQ file
WARNNIG: different read numbers of the 3435 pack
Read1 pack size: 256
Read2 pack size: 230
Ignore the unmatched reads
Read1 before filtering:
total reads: 879590
total bases: 87538237
Q20 bases: 86810309(99.1684%)
Q30 bases: 85308802(97.4532%)
Q40 bases: 75170632(85.8718%)
Read2 before filtering:
total reads: 879590
total bases: 87526817
Q20 bases: 86546608(98.8801%)
Q30 bases: 84989839(97.1015%)
Q40 bases: 74988693(85.6751%)
Read1 after filtering:
total reads: 876494
total bases: 87081659
Q20 bases: 86404863(99.2228%)
Q30 bases: 84958415(97.5618%)
Q40 bases: 74928827(86.0443%)
Read2 after filtering:
total reads: 876494
total bases: 87054021
Q20 bases: 86214331(99.0354%)
Q30 bases: 84730416(97.3308%)
Q40 bases: 74811097(85.9364%)
Filtering result:
reads passed filter: 1752988
reads failed due to low quality: 4938
reads failed due to too many N: 820
reads failed due to too short: 434
reads with adapter trimmed: 1854
bases trimmed due to adapters: 73802
Duplication rate: 7.83626%
Insert size peak (evaluated by paired-end reads): 163
But with the older version 0.20.0 it runs through with no issues.
Read1 before filtering:
total reads: 32508342
total bases: 3235377165
Q20 bases: 3201955685(98.967%)
Q30 bases: 3136726311(96.9509%)
Read2 before filtering:
total reads: 32508342
total bases: 3235090844
Q20 bases: 3175582248(98.1605%)
Q30 bases: 3085760868(95.3841%)
Read1 after filtering:
total reads: 32326557
total bases: 3213262362
Q20 bases: 3182580204(99.0451%)
Q30 bases: 3120584889(97.1158%)
Read2 aftering filtering:
total reads: 32326557
total bases: 3212818050
Q20 bases: 3162074455(98.4206%)
Q30 bases: 3076691989(95.763%)
Filtering result:
reads passed filter: 64653114
reads failed due to low quality: 343946
reads failed due to too many N: 5236
reads failed due to too short: 14388
reads with adapter trimmed: 826473
bases trimmed due to adapters: 9257428
Duplication rate: 11.3393%
Insert size peak (evaluated by paired-end reads): 161
Now, here's the worrying part, with version 0.20.0, the oldest, I get ~32M reads after filtering, with the latest version 1.0.1, I only get ~800K, which suggests fastp is simply stopping at the empty read, rather than ignoring it, and throwing away most of the reads? Whereas previous versions were seemingly able to deal with this appropriately.
Here's the fastp command I ran in all cases, run on the same fastq, just changing the version
fastp --in1 fq1.fq.gz --in2 fq2.fq.gz \
--out1 r1.fq.gz --out2 r2.fq.gz \
--length_required 36 \
--adapter_fasta "illumina.fa" \
--cut_mean_quality 10 \
--cut_window_size 4 \
-5 \
-3 \
--thread 1 \
--average_qual 20 \
--report_title "sample_name-flowcell-lane" \
--json sample_name-flowcell-lane.fastp.json
I will take a look soon
Seems that your FASTQ is weird, can you please upload it here and I can have a try.
Thank you for looking into this. For more context, this isn't the only pair of fastqs where I had the issue. Other fastqs failed in the same way, always for reads that were empty. And again, all these worked fine with older versions of fastp. I also checked and the older version seems to be doing it right, just removing empty reads. Some of these fastqs are pretty old (about a decade for some of them), so I wonder if there have been changes in format? The fastqs for the example I posted are in this link. You will need to request access as these are sensitive samples. Let me know if more information is needed. Thanks!
Thank you for looking into this. For more context, this isn't the only pair of fastqs where I had the issue. Other fastqs failed in the same way, always for reads that were empty. And again, all these worked fine with older versions of fastp. I also checked and the older version seems to be doing it right, just removing empty reads. Some of these fastqs are pretty old (about a decade for some of them), so I wonder if there have been changes in format? The fastqs for the example I posted are in this link. You will need to request access as these are sensitive samples. Let me know if more information is needed. Thanks!
I requested for the access.
Or you can upload a piece of data here, so that I can download it and reproduce this issue.
You should have access by now, let me know if there are any issues
Hello, any updates? If this issue is becoming tricky, is there anything known to be wrong with using version 0.20.0? That is the last version that seems to work as expected for us.
Many thanks!
Hello, sorry to come back to this one, is it possible to get an answer as to whether it is "safe" to use older versions? Many thanks!
Just checking back in, were there any updates? Or is it otherwise safe to use the older version? Please let me know if anything else is needed from my side. Many thanks!