fastp
fastp copied to clipboard
Reads go missing after quality trimming
I use the following command on my paired reads:
fastp --reads_to_process 1000000 -i data/1_S1_L001_R1_001.fastq.gz -I data/1_S1_L001_R2_001.fastq.gz -o 1_S1_L001_R1_001.fastq.gz -O 1_S1_L001_R2_001.fastq.gz --adapter_sequence GATCGGAAGAGCACACGTCTGAACTCCAGTCACGTAACATCATCTCGTAT --adapter_sequence_r2 GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAATCGCTGGTGTAGATCT --unpaired1 1_S1_L001_singlets.fastq.gz --unpaired2 1_S1_L001_singlets.fastq.gz --failed_out 1_S1_L001_failed.fastq.gz -l 30 -Q -y --cut_right --cut_right_window_size 5 --cut_right_mean_quality 25 -w 16
This give a report like this:
Read1 before filtering: total reads: 1000000 total bases: 300000000 Q20 bases: 265210120(88.4034%) Q30 bases: 237863991(79.288%) Read2 before filtering: total reads: 1000000 total bases: 300000000 Q20 bases: 217867400(72.6225%) Q30 bases: 186389929(62.13%) Read1 after filtering: total reads: 811011 total bases: 166740824 Q20 bases: 166036662(99.5777%) Q30 bases: 161429716(96.8148%) Read2 aftering filtering: total reads: 811011 total bases: 136920093 Q20 bases: 136066202(99.3764%) Q30 bases: 130934145(95.6281%) Filtering result: reads passed filter: 1622022 reads failed due to low quality: 0 reads failed due to too many N: 0 reads failed due to too short: 377924 reads failed due to low complexity: 54 reads with adapter trimmed: 763879 bases trimmed due to adapters: 35194909 Duplication rate: 0.12957% Insert size peak (evaluated by paired-end reads): 300 JSON report: fastp.json HTML report: fastp.html
However, when I count the reads in the resulting files, they do not sum up to 2x 1 million:
seqkit stats *.gz
file format type num_seqs sum_len min_len avg_len max_len 1_S1_L001_failed.fastq.gz FASTQ DNA 136,616 8,095,004 0 59.3 300 1_S1_L001_R1_001.fastq.gz FASTQ DNA 811,011 166,740,824 30 205.6 300 1_S1_L001_R2_001.fastq.gz FASTQ DNA 811,011 136,920,093 30 168.8 281 1_S1_L001_singlets.fastq.gz FASTQ DNA 136,616 16,948,652 30 124.1 300
num_seqs sums up to 947627 per sample, so 52373 reads per file have an unknown fate.
Did I miss something in the manual? I'm a bit puzzled by where this 5% of the reads go.
I have observed fastp outputting reordered reads and also having singletons and pairs dissappear. When I have a chance, I'm going to see if I can track down where things are ending up.
Any progress on this?
I have also been having this same problem
I thought I had the same issue, but realized that fastp has a few default filters that may adjust the final read count that isn't reported to stdout
. The total number of filtered reads is written to the fastp.json
file and there should be a region in that file that looks something like this:
"filtering_result": { "passed_filter_reads": 23510872, "low_quality_reads": 403262, "too_many_N_reads": 660, "low_complexity_reads": 11284, "too_short_reads": 93178, "too_long_reads": 0 }
I specifically ran fastp with the --disable_length_filtering
flag but noticed that some reads were still filtered for being too_short
. I believe fastp removes reads that have a read length of 0 after the trimming filters are applied. Hope this helps!
I am also getting this problem - but it is very difficult to track down why. Some read files lose paired read headers after fastp, e.g.
zgrep -c "@A00548" filt-test/134512csiro_R1.fastq.gz
3668110
zgrep -c "@A00548" filt-test/134512csiro_R2.fastq.gz
3668110
zgrep "@A00548:491:H5TT7DRX3:2:2154:7708:19883" filt-test/134512csiro_R1.fastq.gz
So both files are same length but a header does not match between R1 and R2. This problem came up through an error in downstream analysis.
btw both headers are present in the input files.
If I try a test with jus those two reads it works fine.
I have just re-run this on an HPC with much larger RAM and the problem has gone away. It was also running multi-threaded on many samples, so I suspect this was a memory issue. I cannot say how exactly it was fixed.
Ignore the above. I have since found that the original read files had mismatched pairs, and as mentioned, fastp does not check this so the problem persisted downstream. I used bbtools repair.sh to fix it.