fastp icon indicating copy to clipboard operation
fastp copied to clipboard

Fastp 1.0.1 throwing reads away when encountering empty read

Open MartinezRuiz-Carlos opened this issue 9 months ago • 8 comments

Hello, This is an issue related to #534 and #560 , but I thought it was worth raising a separate issue as it is not quite the same. I have several fastqs with empty sequences, as described in #560 , e.g.

@K00371:221:H2NKWBBXY:6:1104:25824:8260 2:N:0:AGTACAAG
AGGCCAACAGGTAGGTCTCTGAAAAATGAAGAACAGATATTCATAAGCTATAATGAAATAATTCAAACTTATTTCATTACCTCCCTTGAATACAGACTA
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJ
@K00371:221:H2NKWBBXY:6:1104:25986:8260 2:N:0:AGTACAAG

+

@K00371:221:H2NKWBBXY:6:1104:26250:8260 2:N:0:AGTACAAG
ATTTAGTATAATAAACATTACCAAATCTTTCTTTCCTAAGGCACCATTCTGATTTATAGGTCAGGCTGCCTGACTCTAAGGAAATAACTGGTAAGGATAC
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJ
@K00371:221:H2NKWBBXY:6:1104:26880:8260 2:N:0:AGTACAAG

Running different versions of fastp results in different outcomes for this file. Indeed, with version 0.24.1 I get the error

@K00371:221:H2NKWBBXY:6:1104:25986:8260 2:N:0:AGTACAAG
Expected '+', got 
ERROR: '+' expected

With the latest version 1.0.1 I get a warning instead, but the run finishes:

@K00371:221:H2NKWBBXY:6:1104:25986:8260 2:N:0:AGTACAAG
Expected '+', got 
Your FASTQ may be invalid, please check the tail of your FASTQ file

WARNNIG: different read numbers of the 3435 pack
Read1 pack size: 256
Read2 pack size: 230
Ignore the unmatched reads

Read1 before filtering:
total reads: 879590
total bases: 87538237
Q20 bases: 86810309(99.1684%)
Q30 bases: 85308802(97.4532%)
Q40 bases: 75170632(85.8718%)

Read2 before filtering:
total reads: 879590
total bases: 87526817
Q20 bases: 86546608(98.8801%)
Q30 bases: 84989839(97.1015%)
Q40 bases: 74988693(85.6751%)

Read1 after filtering:
total reads: 876494
total bases: 87081659
Q20 bases: 86404863(99.2228%)
Q30 bases: 84958415(97.5618%)
Q40 bases: 74928827(86.0443%)

Read2 after filtering:
total reads: 876494
total bases: 87054021
Q20 bases: 86214331(99.0354%)
Q30 bases: 84730416(97.3308%)
Q40 bases: 74811097(85.9364%)

Filtering result:
reads passed filter: 1752988
reads failed due to low quality: 4938
reads failed due to too many N: 820
reads failed due to too short: 434
reads with adapter trimmed: 1854
bases trimmed due to adapters: 73802

Duplication rate: 7.83626%

Insert size peak (evaluated by paired-end reads): 163

But with the older version 0.20.0 it runs through with no issues.

Read1 before filtering:
total reads: 32508342
total bases: 3235377165
Q20 bases: 3201955685(98.967%)
Q30 bases: 3136726311(96.9509%)

Read2 before filtering:
total reads: 32508342
total bases: 3235090844
Q20 bases: 3175582248(98.1605%)
Q30 bases: 3085760868(95.3841%)

Read1 after filtering:
total reads: 32326557
total bases: 3213262362
Q20 bases: 3182580204(99.0451%)
Q30 bases: 3120584889(97.1158%)

Read2 aftering filtering:
total reads: 32326557
total bases: 3212818050
Q20 bases: 3162074455(98.4206%)
Q30 bases: 3076691989(95.763%)

Filtering result:
reads passed filter: 64653114
reads failed due to low quality: 343946
reads failed due to too many N: 5236
reads failed due to too short: 14388
reads with adapter trimmed: 826473
bases trimmed due to adapters: 9257428

Duplication rate: 11.3393%

Insert size peak (evaluated by paired-end reads): 161

Now, here's the worrying part, with version 0.20.0, the oldest, I get ~32M reads after filtering, with the latest version 1.0.1, I only get ~800K, which suggests fastp is simply stopping at the empty read, rather than ignoring it, and throwing away most of the reads? Whereas previous versions were seemingly able to deal with this appropriately.

Here's the fastp command I ran in all cases, run on the same fastq, just changing the version

fastp --in1 fq1.fq.gz --in2 fq2.fq.gz \
      --out1 r1.fq.gz --out2 r2.fq.gz \
      --length_required 36 \
      --adapter_fasta "illumina.fa" \
      --cut_mean_quality 10 \
      --cut_window_size 4 \
      -5 \
      -3 \
      --thread 1 \
      --average_qual 20 \
      --report_title "sample_name-flowcell-lane" \
      --json sample_name-flowcell-lane.fastp.json

MartinezRuiz-Carlos avatar Jun 24 '25 12:06 MartinezRuiz-Carlos

I will take a look soon

sfchen avatar Jun 26 '25 22:06 sfchen

Seems that your FASTQ is weird, can you please upload it here and I can have a try.

sfchen avatar Jun 26 '25 22:06 sfchen

Thank you for looking into this. For more context, this isn't the only pair of fastqs where I had the issue. Other fastqs failed in the same way, always for reads that were empty. And again, all these worked fine with older versions of fastp. I also checked and the older version seems to be doing it right, just removing empty reads. Some of these fastqs are pretty old (about a decade for some of them), so I wonder if there have been changes in format? The fastqs for the example I posted are in this link. You will need to request access as these are sensitive samples. Let me know if more information is needed. Thanks!

MartinezRuiz-Carlos avatar Jun 30 '25 09:06 MartinezRuiz-Carlos

Thank you for looking into this. For more context, this isn't the only pair of fastqs where I had the issue. Other fastqs failed in the same way, always for reads that were empty. And again, all these worked fine with older versions of fastp. I also checked and the older version seems to be doing it right, just removing empty reads. Some of these fastqs are pretty old (about a decade for some of them), so I wonder if there have been changes in format? The fastqs for the example I posted are in this link. You will need to request access as these are sensitive samples. Let me know if more information is needed. Thanks!

I requested for the access.

Or you can upload a piece of data here, so that I can download it and reproduce this issue.

sfchen avatar Jun 30 '25 23:06 sfchen

You should have access by now, let me know if there are any issues

MartinezRuiz-Carlos avatar Jul 03 '25 11:07 MartinezRuiz-Carlos

Hello, any updates? If this issue is becoming tricky, is there anything known to be wrong with using version 0.20.0? That is the last version that seems to work as expected for us. Many thanks!

MartinezRuiz-Carlos avatar Jul 15 '25 08:07 MartinezRuiz-Carlos

Hello, sorry to come back to this one, is it possible to get an answer as to whether it is "safe" to use older versions? Many thanks!

MartinezRuiz-Carlos avatar Aug 12 '25 20:08 MartinezRuiz-Carlos

Just checking back in, were there any updates? Or is it otherwise safe to use the older version? Please let me know if anything else is needed from my side. Many thanks!

MartinezRuiz-Carlos avatar Oct 21 '25 15:10 MartinezRuiz-Carlos