fastp icon indicating copy to clipboard operation
fastp copied to clipboard

global trimming trims too much when combined with quality pruning for SE data

Open eaton-lab opened this issue 1 year ago • 1 comments

Here is a simple example to demonstrate. The sequences are high quality and so will not be pruned for quality, but I wish to trim the first 6 bases. The trimming works correctly when only global trimming is on (-f 6; six bases are removed), and it works correctly when only quality pruning is on (-5 -3; no bases are removed), but it is incorrect when both are turned on (-f 6 -5 -3; nine bases are removed).


cat <<EOT > /tmp/test.fastq
@NB502016:186:H2HNLBGXN:1:11101:4872:1081 1:N:0:0
AACTGGTGCAGTATAATCTATTTTAAATACCTCCATCCTAAATAGAGGAACCCTATGGTCTAACCCCCATTTAATTGCATTACGAATGTCAA
+
AAAAAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEAE/EEAEEEEEEEEE<EAEEEEEEEEEA<E
@NB502016:186:H2HNLBGXN:1:11101:20414:1089 1:N:0:0
AACTGGTGCAGGAAATTGTTTGGCCGCAGGAAGAAGAATAAGAATCGTCCGAATTTGTCTTCCAATAATTCTGACTTGCTTTGTTCTGGTTC
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@NB502016:186:H2HNLBGXN:1:11101:21961:1096 1:N:0:0
AACTGGTGCAGCTGAGTAACACCCGCACAAACCAACACCATAGGTATTAGGTAGGACCAACTGCTAGCACATTGGCCGAACTAATCCAAATC
+
AAAAAEEAEEEEEEEEEEAEAEEE<EEEEEAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEEEAEEEEEEE<<EEEEEEEAEEEEEE/
@NB502016:186:H2HNLBGXN:1:11101:4227:1163 1:N:0:0
AACTGGTGCAGCCCGTAAGCCGACACCACCGCGTCTAAAAATCCGTAACCATTAGACAGCCCCGCGGACTTAACGACTTTTCCACAAACGGC
+
AAAAAEEE//EEEEEEEAEAEEEAEEEEEEEEEEEEEEEE/EEEEEA/AE/EE6E6EAEEEE6E/EE/EEE//6EEEAEEE/E/EE66E<EA
EOT

# THIS WORKS, 6 bases are trimmed from front so that all new reads start with 'TGCAG'
$fastp -i /tmp/test.fastq -o /tmp/test.trimmed.fastq -h /tmp/test.html -j /tmp/test.json -f 6
cat /tmp/test.fastq
echo "================================================================================================="
cat /tmp/test.trimmed.fastq

# THIS DOESN'T WORK, for some reason now 9 bases are trimmed!
$fastp -i /tmp/test.fastq -o /tmp/test.trimmed.fastq -h /tmp/test.html -j /tmp/test.json -f 6 -5 -3
cat /tmp/test.fastq
echo "================================================================================================="
cat /tmp/test.trimmed.fastq

# this shows that quality pruning on its own is not the problem.
$fastp -i /tmp/test.fastq -o /tmp/test.trimmed.fastq -h /tmp/test.html -j /tmp/test.json -5 -3
cat /tmp/test.fastq
echo "================================================================================================="
cat /tmp/test.trimmed.fastq

RESULT

Detecting adapter sequence for read1...
No adapter detected for read1

Read1 before filtering:
total reads: 4
total bases: 368
Q20 bases: 355(96.4674%)
Q30 bases: 342(92.9348%)

Read1 after filtering:
total reads: 4
total bases: 344
Q20 bases: 331(96.2209%)
Q30 bases: 318(92.4419%)

Filtering result:
reads passed filter: 4
reads failed due to low quality: 0
reads failed due to too many N: 0
reads failed due to too short: 0
reads with adapter trimmed: 0
bases trimmed due to adapters: 0

Duplication rate (may be overestimated since this is SE data): 0%

JSON report: /tmp/test.json
HTML report: /tmp/test.html

/home/deren/mambaforge-pypy3/envs/ipyrad1/bin/fastp -i /tmp/test.fastq -o /tmp/test.trimmed.fastq -h /tmp/test.html -j /tmp/test.json -f 6 
fastp v0.23.2, time used: 0 seconds
@NB502016:186:H2HNLBGXN:1:11101:4872:1081 1:N:0:0
AACTGGTGCAGTATAATCTATTTTAAATACCTCCATCCTAAATAGAGGAACCCTATGGTCTAACCCCCATTTAATTGCATTACGAATGTCAA
+
AAAAAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEAE/EEAEEEEEEEEE<EAEEEEEEEEEA<E
@NB502016:186:H2HNLBGXN:1:11101:20414:1089 1:N:0:0
AACTGGTGCAGGAAATTGTTTGGCCGCAGGAAGAAGAATAAGAATCGTCCGAATTTGTCTTCCAATAATTCTGACTTGCTTTGTTCTGGTTC
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@NB502016:186:H2HNLBGXN:1:11101:21961:1096 1:N:0:0
AACTGGTGCAGCTGAGTAACACCCGCACAAACCAACACCATAGGTATTAGGTAGGACCAACTGCTAGCACATTGGCCGAACTAATCCAAATC
+
AAAAAEEAEEEEEEEEEEAEAEEE<EEEEEAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEEEAEEEEEEE<<EEEEEEEAEEEEEE/
@NB502016:186:H2HNLBGXN:1:11101:4227:1163 1:N:0:0
AACTGGTGCAGCCCGTAAGCCGACACCACCGCGTCTAAAAATCCGTAACCATTAGACAGCCCCGCGGACTTAACGACTTTTCCACAAACGGC
+
AAAAAEEE//EEEEEEEAEAEEEAEEEEEEEEEEEEEEEE/EEEEEA/AE/EE6E6EAEEEE6E/EE/EEE//6EEEAEEE/E/EE66E<EA
=================================================================================================
@NB502016:186:H2HNLBGXN:1:11101:4872:1081 1:N:0:0
TGCAGTATAATCTATTTTAAATACCTCCATCCTAAATAGAGGAACCCTATGGTCTAACCCCCATTTAATTGCATTACGAATGTCAA
+
AEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEAE/EEAEEEEEEEEE<EAEEEEEEEEEA<E
@NB502016:186:H2HNLBGXN:1:11101:20414:1089 1:N:0:0
TGCAGGAAATTGTTTGGCCGCAGGAAGAAGAATAAGAATCGTCCGAATTTGTCTTCCAATAATTCTGACTTGCTTTGTTCTGGTTC
+
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@NB502016:186:H2HNLBGXN:1:11101:21961:1096 1:N:0:0
TGCAGCTGAGTAACACCCGCACAAACCAACACCATAGGTATTAGGTAGGACCAACTGCTAGCACATTGGCCGAACTAATCCAAATC
+
EAEEEEEEEEEEAEAEEE<EEEEEAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEEEAEEEEEEE<<EEEEEEEAEEEEEE/
@NB502016:186:H2HNLBGXN:1:11101:4227:1163 1:N:0:0
TGCAGCCCGTAAGCCGACACCACCGCGTCTAAAAATCCGTAACCATTAGACAGCCCCGCGGACTTAACGACTTTTCCACAAACGGC
+
EE//EEEEEEEAEAEEEAEEEEEEEEEEEEEEEE/EEEEEA/AE/EE6E6EAEEEE6E/EE/EEE//6EEEAEEE/E/EE66E<EA
Detecting adapter sequence for read1...
No adapter detected for read1

Read1 before filtering:
total reads: 4
total bases: 368
Q20 bases: 355(96.4674%)
Q30 bases: 342(92.9348%)

Read1 after filtering:
total reads: 4
total bases: 332
Q20 bases: 320(96.3855%)
Q30 bases: 307(92.4699%)

Filtering result:
reads passed filter: 4
reads failed due to low quality: 0
reads failed due to too many N: 0
reads failed due to too short: 0
reads with adapter trimmed: 0
bases trimmed due to adapters: 0

Duplication rate (may be overestimated since this is SE data): 0%

JSON report: /tmp/test.json
HTML report: /tmp/test.html

/home/deren/mambaforge-pypy3/envs/ipyrad1/bin/fastp -i /tmp/test.fastq -o /tmp/test.trimmed.fastq -h /tmp/test.html -j /tmp/test.json -f 6 -5 -3 
fastp v0.23.2, time used: 1 seconds
@NB502016:186:H2HNLBGXN:1:11101:4872:1081 1:N:0:0
AACTGGTGCAGTATAATCTATTTTAAATACCTCCATCCTAAATAGAGGAACCCTATGGTCTAACCCCCATTTAATTGCATTACGAATGTCAA
+
AAAAAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEAE/EEAEEEEEEEEE<EAEEEEEEEEEA<E
@NB502016:186:H2HNLBGXN:1:11101:20414:1089 1:N:0:0
AACTGGTGCAGGAAATTGTTTGGCCGCAGGAAGAAGAATAAGAATCGTCCGAATTTGTCTTCCAATAATTCTGACTTGCTTTGTTCTGGTTC
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@NB502016:186:H2HNLBGXN:1:11101:21961:1096 1:N:0:0
AACTGGTGCAGCTGAGTAACACCCGCACAAACCAACACCATAGGTATTAGGTAGGACCAACTGCTAGCACATTGGCCGAACTAATCCAAATC
+
AAAAAEEAEEEEEEEEEEAEAEEE<EEEEEAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEEEAEEEEEEE<<EEEEEEEAEEEEEE/
@NB502016:186:H2HNLBGXN:1:11101:4227:1163 1:N:0:0
AACTGGTGCAGCCCGTAAGCCGACACCACCGCGTCTAAAAATCCGTAACCATTAGACAGCCCCGCGGACTTAACGACTTTTCCACAAACGGC
+
AAAAAEEE//EEEEEEEAEAEEEAEEEEEEEEEEEEEEEE/EEEEEA/AE/EE6E6EAEEEE6E/EE/EEE//6EEEAEEE/E/EE66E<EA
=================================================================================================
@NB502016:186:H2HNLBGXN:1:11101:4872:1081 1:N:0:0
AGTATAATCTATTTTAAATACCTCCATCCTAAATAGAGGAACCCTATGGTCTAACCCCCATTTAATTGCATTACGAATGTCAA
+
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEAE/EEAEEEEEEEEE<EAEEEEEEEEEA<E
@NB502016:186:H2HNLBGXN:1:11101:20414:1089 1:N:0:0
AGGAAATTGTTTGGCCGCAGGAAGAAGAATAAGAATCGTCCGAATTTGTCTTCCAATAATTCTGACTTGCTTTGTTCTGGTTC
+
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@NB502016:186:H2HNLBGXN:1:11101:21961:1096 1:N:0:0
AGCTGAGTAACACCCGCACAAACCAACACCATAGGTATTAGGTAGGACCAACTGCTAGCACATTGGCCGAACTAATCCAAATC
+
EEEEEEEEEAEAEEE<EEEEEAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEEEAEEEEEEE<<EEEEEEEAEEEEEE/
@NB502016:186:H2HNLBGXN:1:11101:4227:1163 1:N:0:0
AGCCCGTAAGCCGACACCACCGCGTCTAAAAATCCGTAACCATTAGACAGCCCCGCGGACTTAACGACTTTTCCACAAACGGC
+
/EEEEEEEAEAEEEAEEEEEEEEEEEEEEEE/EEEEEA/AE/EE6E6EAEEEE6E/EE/EEE//6EEEAEEE/E/EE66E<EA
Detecting adapter sequence for read1...
No adapter detected for read1

Read1 before filtering:
total reads: 4
total bases: 368
Q20 bases: 355(96.4674%)
Q30 bases: 342(92.9348%)

Read1 after filtering:
total reads: 4
total bases: 368
Q20 bases: 355(96.4674%)
Q30 bases: 342(92.9348%)

Filtering result:
reads passed filter: 4
reads failed due to low quality: 0
reads failed due to too many N: 0
reads failed due to too short: 0
reads with adapter trimmed: 0
bases trimmed due to adapters: 0

Duplication rate (may be overestimated since this is SE data): 0%

JSON report: /tmp/test.json
HTML report: /tmp/test.html

/home/deren/mambaforge-pypy3/envs/ipyrad1/bin/fastp -i /tmp/test.fastq -o /tmp/test.trimmed.fastq -h /tmp/test.html -j /tmp/test.json -5 -3 
fastp v0.23.2, time used: 0 seconds
@NB502016:186:H2HNLBGXN:1:11101:4872:1081 1:N:0:0
AACTGGTGCAGTATAATCTATTTTAAATACCTCCATCCTAAATAGAGGAACCCTATGGTCTAACCCCCATTTAATTGCATTACGAATGTCAA
+
AAAAAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEAE/EEAEEEEEEEEE<EAEEEEEEEEEA<E
@NB502016:186:H2HNLBGXN:1:11101:20414:1089 1:N:0:0
AACTGGTGCAGGAAATTGTTTGGCCGCAGGAAGAAGAATAAGAATCGTCCGAATTTGTCTTCCAATAATTCTGACTTGCTTTGTTCTGGTTC
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@NB502016:186:H2HNLBGXN:1:11101:21961:1096 1:N:0:0
AACTGGTGCAGCTGAGTAACACCCGCACAAACCAACACCATAGGTATTAGGTAGGACCAACTGCTAGCACATTGGCCGAACTAATCCAAATC
+
AAAAAEEAEEEEEEEEEEAEAEEE<EEEEEAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEEEAEEEEEEE<<EEEEEEEAEEEEEE/
@NB502016:186:H2HNLBGXN:1:11101:4227:1163 1:N:0:0
AACTGGTGCAGCCCGTAAGCCGACACCACCGCGTCTAAAAATCCGTAACCATTAGACAGCCCCGCGGACTTAACGACTTTTCCACAAACGGC
+
AAAAAEEE//EEEEEEEAEAEEEAEEEEEEEEEEEEEEEE/EEEEEA/AE/EE6E6EAEEEE6E/EE/EEE//6EEEAEEE/E/EE66E<EA
=================================================================================================
@NB502016:186:H2HNLBGXN:1:11101:4872:1081 1:N:0:0
AACTGGTGCAGTATAATCTATTTTAAATACCTCCATCCTAAATAGAGGAACCCTATGGTCTAACCCCCATTTAATTGCATTACGAATGTCAA
+
AAAAAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEAE/EEAEEEEEEEEE<EAEEEEEEEEEA<E
@NB502016:186:H2HNLBGXN:1:11101:20414:1089 1:N:0:0
AACTGGTGCAGGAAATTGTTTGGCCGCAGGAAGAAGAATAAGAATCGTCCGAATTTGTCTTCCAATAATTCTGACTTGCTTTGTTCTGGTTC
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@NB502016:186:H2HNLBGXN:1:11101:21961:1096 1:N:0:0
AACTGGTGCAGCTGAGTAACACCCGCACAAACCAACACCATAGGTATTAGGTAGGACCAACTGCTAGCACATTGGCCGAACTAATCCAAATC
+
AAAAAEEAEEEEEEEEEEAEAEEE<EEEEEAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEEEAEEEEEEE<<EEEEEEEAEEEEEE/
@NB502016:186:H2HNLBGXN:1:11101:4227:1163 1:N:0:0
AACTGGTGCAGCCCGTAAGCCGACACCACCGCGTCTAAAAATCCGTAACCATTAGACAGCCCCGCGGACTTAACGACTTTTCCACAAACGGC
+
AAAAAEEE//EEEEEEEAEAEEEAEEEEEEEEEEEEEEEE/EEEEEA/AE/EE6E6EAEEEE6E/EE/EEE//6EEEAEEE/E/EE66E<EA

eaton-lab avatar Mar 21 '23 17:03 eaton-lab

bump.

dereneaton avatar Jul 28 '23 16:07 dereneaton