fastp icon indicating copy to clipboard operation
fastp copied to clipboard

add adapters sequences from BGI/MGI sequencing data to built-in adapters

Open guidohooiveld opened this issue 5 years ago • 13 comments

Hi. I noticed that on the SEQanswers forum a document from BGI has been posted that lists all sequences for the oligos and primers used for BGISEQ/DNBSEQ/MGISEQ library preparation. See here for the thread (2nd post).

On page 7:

The following sequences are used to filter the adapter contamination in raw data.
Forward filter:  AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA
Reverse filter:  AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG

Could these 2 (or maybe all listed) sequences be added to the set of built-in adapters fastp uses?

Thanks, Guido

guidohooiveld avatar Jun 03 '20 19:06 guidohooiveld

Ok, I will add them.

sfchen avatar Jun 04 '20 07:06 sfchen

After a search, I cannot confirm that these two sequences are BGI-Seq adapters.

I will contact BGI-Seq team to get their official adapter sequences, and update fastp as well.

sfchen avatar Jun 04 '20 08:06 sfchen

Great, thanks for your willingness to do this! BTW, out of curiosity, how did you check this / were not able to confirm?

guidohooiveld avatar Jun 04 '20 08:06 guidohooiveld

I have got response from BGI team, they will send me the adapter list in a couple of days.

I will update then and release a new fastp version.

sfchen avatar Jun 04 '20 13:06 sfchen

Being curious: was the BGI team able to provide the adapter sequences?

guidohooiveld avatar Jun 30 '20 07:06 guidohooiveld

Any update on this? I also just received my first BGISeq data.

Shellfishgene avatar Sep 02 '20 15:09 Shellfishgene

Kind reminder; I am about to receive another BGISeq data set. Thanks!

guidohooiveld avatar Oct 06 '20 11:10 guidohooiveld

Hi, I just got the sequences from MGI. I will update the built-in adapter sequences.

sfchen avatar Oct 08 '20 12:10 sfchen

I just add MGI/BGI adapter sequences to the known adapters:

knownAdapters["AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA"] = ">MGI/BGI adapter (forward)";
knownAdapters["AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG"] = ">MGI/BGI adapter (reverse)";

Could you please try the latest build, or use the latest prebuilt binary?

If you can upload a small MGI/BGI data, I can also have a try.

sfchen avatar Oct 13 '20 06:10 sfchen

Sorry for my delayed reply. I used the latest version on Github (0.21), and compared the results obtained with the version before (0.20.1). To my surprise, both results were exactly the same. Is this expected, even though adapter trimming likely was done by BGI?? Still, I would have expected that some BGI adapters should have been found/trimmed, especially when these are specifically searched for. Thus that the results between the 2 versions should be slightly different, but not identical (at least for the number of bases trimmed due to adapters).

Filtering result:
reads passed filter: 43562268
reads failed due to low quality: 0
reads failed due to too many N: 0
reads failed due to too short: 0
reads failed due to low complexity: 2182
reads with adapter trimmed: 2837340
bases trimmed due to adapters: 14182202

Adapter or bad ligation of read1
The input has little adapter percentage (~0.217030%), probably it's trimmed before.
Adapter or bad ligation of read2
The input has little adapter percentage (~0.217030%), probably it's trimmed before.

fastp run command: fastp --in1 ./TEST_IN/RNA-1/RNA-1_1.fq.gz --in2 ./TEST_IN/RNA-1/RNA-1_2.fq.gz --out1=./TEST_OUT/RNA-1/RNA-1_1.fq.gz --out2=./TEST_OUT/RNA-1/RNA-1_2.fq.gz --low_complexity_filter --thread=16 --json ./TEST_OUT/RNA-1/RNA-1.fastp.json --html ./TEST_OUT/RNA-1/RNA-1.fastp.html

guidohooiveld avatar Oct 28 '20 22:10 guidohooiveld

Since your data is paired-end, fastp can trim the adapters without adapter sequence provided. So it already worked before.

sfchen avatar Oct 29 '20 00:10 sfchen

Aha, I got it. I was a little confused; I assumed that since the adapter sequence auto-detection is disabled by default for PE data, adapter detection overlap analysis would also be disabled. However, I now understand that these are 2 separate processes, and that for PE data the latter (= adapter detection by per-read overlap analysis) is always occurring (and apparently cannot be disabled). Hence, results between versions are identical...

guidohooiveld avatar Oct 30 '20 14:10 guidohooiveld

Aha, I got it. I was a little confused; I assumed that since the adapter sequence auto-detection is disabled by default for PE data, adapter detection overlap analysis would also be disabled. However, I now understand that these are 2 separate processes, and that for PE data the latter (= adapter detection by per-read overlap analysis) is always occurring (and apparently cannot be disabled). Hence, results between versions are identical...

above your fastp run command: had no --detect_adapter_for_pe, becuase it default is turn off for your PE data, so it you got the same result for two runs(0.21 + 0.20.1) ? For PE data, the adapter sequence auto-detection is disabled by default since the adapters can be trimmed by overlap analysis. However, you can specify --detect_adapter_for_pe to enable it.

orangeSi avatar Oct 16 '24 02:10 orangeSi