fastp
fastp copied to clipboard
Incorrect adapter detected?
Sometimes fastp detects what could be real genomic repetitive sequence as adaptor, for example:
Detected read1 adapter: | AACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACC
Has anyone else seen this behaviour?
My command was:
fastp -w 16 --dont_overwrite -Q -z 1 --in1 SB12_R1.fq.gz --in2 SB12_R2.fq.gz --out1 ../01_adapters_removed/tmp_fastp/SB12_R1.fq.gz --out2 ../01_adapters_removed/tmp_fastp/SB12_R2.fq.gz --detect_adapter_for_pe -l 21 --json ../01_adapters_removed/SB12.fastp.adapters.json --html ../01_adapters_removed/SB12.fastp.adapters.html
Searching the literature, AAACCCT seems to be a common motif in telomeric repeats in plants:
https://www.nature.com/articles/nature15714/tables/1
How can I avoid this false detection?
Edgardo
Thanks for your info.
Which version did you use? Can you upload a file with the first 100K reads here, so I can reproduce the problem?
You can remove --detect_adapter_for_pe for WAR. Most adapters will be still trimmed by overlap detection.
I am using v. 0.20.0. The telomeric sequence is not detected as adadpter when I remove --detect_adapter_for_pe from the command. I just included that option because I read in your manual it makes it more sensitive.
I also attach the first 100k reads, thanks for the help
100k_R1.fq.gz
100k_R2.fq.gz
Thanks, I will find a chance to reproduce this issue.
Hi, @sfchen
I have the same issue with @edgardomortiz . Did it have any progress or solution ?
Here is the command I use
fastp -w 6 -6 -i s1.R1.fastq.gz -I s1.R2.fastq.gz --detect_adapter_for_pe --length_required 45 -o s1.Clean.R1.fastq.gz -O s1.Clean.R2.fastq.gz --json s1.fastp.json --html s1.fastp.html
fastp v0.20.0, time used: 397 seconds
Detecting adapter sequence for read1...
CTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAA
Detecting adapter sequence for read2...
CTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAA
Hi, @sfchen
I also have the same issue with @edgardomortiz and @baozg .
Here's the command that I use:
fastp --detect_adapter_for_pe \
--unqualified_percent_limit 50 \
--cut_right --cut_right_window_size 4 --cut_right_mean_quality 20 \
--correction \
--in1 SRR10260015_1.fastq.gz \
--in2 SRR10260015_2.fastq.gz \
--out1 SRR10260015_1_trimmed.fastq.gz \
--out2 SRR10260015_2_trimmed.fastq.gz \
--unpaired1 SRR10260015_1_passed.fastq.gz \
--unpaired2 SRR10260015_2_passed.fastq.gz \
--failed_out fail_out.fastq.gz \
--thread 4 \
2> fastp_log
and here's the stderr on adapter detection:
Detecting adapter sequence for read1...
CTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGT
Detecting adapter sequence for read2...
CAGACAGACAGACAGACAGACAGACAGACAGACAGACAGACAGACAGACAGACAGACAGA
Just as mentioned above, to remove the --detect_adapter_for_pe will still allow adapter-contaminated reads trimmed based on overlap analysis, but avoid the adapter mis-detection for repeated sequence.
I also had the same issue, in 4 out of 32 of my datasets (bird WGS), fastp detects the telomeric repeat as an adapter (CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC) when I use --detect_adapter_for_pe, in version 0.23.2.