fastp icon indicating copy to clipboard operation
fastp copied to clipboard

Reads go missing after quality trimming

Open dnieuw opened this issue 3 years ago • 8 comments

I use the following command on my paired reads:

fastp --reads_to_process 1000000 -i data/1_S1_L001_R1_001.fastq.gz -I data/1_S1_L001_R2_001.fastq.gz -o 1_S1_L001_R1_001.fastq.gz -O 1_S1_L001_R2_001.fastq.gz --adapter_sequence GATCGGAAGAGCACACGTCTGAACTCCAGTCACGTAACATCATCTCGTAT --adapter_sequence_r2 GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAATCGCTGGTGTAGATCT --unpaired1 1_S1_L001_singlets.fastq.gz --unpaired2 1_S1_L001_singlets.fastq.gz --failed_out 1_S1_L001_failed.fastq.gz -l 30 -Q -y --cut_right --cut_right_window_size 5 --cut_right_mean_quality 25 -w 16

This give a report like this:

Read1 before filtering: total reads: 1000000 total bases: 300000000 Q20 bases: 265210120(88.4034%) Q30 bases: 237863991(79.288%) Read2 before filtering: total reads: 1000000 total bases: 300000000 Q20 bases: 217867400(72.6225%) Q30 bases: 186389929(62.13%) Read1 after filtering: total reads: 811011 total bases: 166740824 Q20 bases: 166036662(99.5777%) Q30 bases: 161429716(96.8148%) Read2 aftering filtering: total reads: 811011 total bases: 136920093 Q20 bases: 136066202(99.3764%) Q30 bases: 130934145(95.6281%) Filtering result: reads passed filter: 1622022 reads failed due to low quality: 0 reads failed due to too many N: 0 reads failed due to too short: 377924 reads failed due to low complexity: 54 reads with adapter trimmed: 763879 bases trimmed due to adapters: 35194909 Duplication rate: 0.12957% Insert size peak (evaluated by paired-end reads): 300 JSON report: fastp.json HTML report: fastp.html

However, when I count the reads in the resulting files, they do not sum up to 2x 1 million:

seqkit stats *.gz
file format type num_seqs sum_len min_len avg_len max_len 1_S1_L001_failed.fastq.gz FASTQ DNA 136,616 8,095,004 0 59.3 300 1_S1_L001_R1_001.fastq.gz FASTQ DNA 811,011 166,740,824 30 205.6 300 1_S1_L001_R2_001.fastq.gz FASTQ DNA 811,011 136,920,093 30 168.8 281 1_S1_L001_singlets.fastq.gz FASTQ DNA 136,616 16,948,652 30 124.1 300

num_seqs sums up to 947627 per sample, so 52373 reads per file have an unknown fate.

Did I miss something in the manual? I'm a bit puzzled by where this 5% of the reads go.

dnieuw avatar Sep 29 '20 14:09 dnieuw

I have observed fastp outputting reordered reads and also having singletons and pairs dissappear. When I have a chance, I'm going to see if I can track down where things are ending up.

iamh2o avatar Oct 26 '21 16:10 iamh2o

Any progress on this?

sentausa avatar Mar 04 '22 09:03 sentausa

I have also been having this same problem

amyhouseman avatar Aug 04 '22 15:08 amyhouseman

I thought I had the same issue, but realized that fastp has a few default filters that may adjust the final read count that isn't reported to stdout. The total number of filtered reads is written to the fastp.json file and there should be a region in that file that looks something like this:

"filtering_result": { "passed_filter_reads": 23510872, "low_quality_reads": 403262, "too_many_N_reads": 660, "low_complexity_reads": 11284, "too_short_reads": 93178, "too_long_reads": 0 }

I specifically ran fastp with the --disable_length_filtering flag but noticed that some reads were still filtered for being too_short. I believe fastp removes reads that have a read length of 0 after the trimming filters are applied. Hope this helps!

WyattEng avatar Dec 09 '22 19:12 WyattEng

I am also getting this problem - but it is very difficult to track down why. Some read files lose paired read headers after fastp, e.g.

zgrep -c "@A00548" filt-test/134512csiro_R1.fastq.gz 3668110 zgrep -c "@A00548" filt-test/134512csiro_R2.fastq.gz 3668110 zgrep "@A00548:491:H5TT7DRX3:2:2154:7708:19883" filt-test/134512csiro_R1.fastq.gz zgrep "@A00548:491:H5TT7DRX3:2:2154:7708:19883" filt-test/134512csiro_R2.fastq.gz @A00548:491:H5TT7DRX3:2:2154:7708:19883 2:N:0:GGACTTGG+CGTCTGCG

So both files are same length but a header does not match between R1 and R2. This problem came up through an error in downstream analysis.

btw both headers are present in the input files.

If I try a test with jus those two reads it works fine.

I have just re-run this on an HPC with much larger RAM and the problem has gone away. It was also running multi-threaded on many samples, so I suspect this was a memory issue. I cannot say how exactly it was fixed.

Ignore the above. I have since found that the original read files had mismatched pairs, and as mentioned, fastp does not check this so the problem persisted downstream. I used bbtools repair.sh to fix it.