dada2 icon indicating copy to clipboard operation
dada2 copied to clipboard

Allowing "justconcatenate = TRUE" to minimize loss of much reads during merging stage

Open biazen123 opened this issue 1 year ago • 7 comments

Hello I am processing the FASTQ raw data for 100 soil DNA samples (2 x 301 PE reads using Illumina MiSeq) for 16S (515F and 806R primers). However, I lost more than 50% of the reads during the preprocessing activities of the DADA2 workflow, with a significant portion at the merging stage due to the low quality of reverse reads.

I have reviewed most discussions on the DADA2 website and articles, which suggest continuing only with forward reads (SE reads) if much read loss is a concern during merging. However, SE reads provide information only from one end of the DNA fragment, which may limit the accuracy of taxonomic assignment and reduce the ability to resolve certain features, such as distinguishing closely related species. Rather than using only SE reads, employing both merged (overlapping) and concatenated (joining) paired-end reads simultaneously can provide a more comprehensive view of the microbial community by leveraging both the accuracy of merged reads and the paired-end information (Dacey and Chain. BMC Bioinformatics 22, 493 (2021). https://doi.org/10.1186/s12859-021-04410-2).

The question arises: can I use "both merged and concatenated paired reads" as a better alternative for the V4 region of 16S rRNA via Illumina sequencer with a 300-base pair read length? Can I use "justconcatenate = TRUE" to minimize the loss of many reads during the merging stage?

biazen123 avatar Dec 15 '23 14:12 biazen123

The sequenced amplicon you are working with is only 251-256 nts, so there is not really any meaningful gain in length of merged amplicon from joining forward/reverse reads here, if you are keeping ~250 nts of forward read.

One thing to be aware of is that you need to truncate these reads at 250nts (or less), otherwise you read into the other primer and adapter, which will cause problems with an ASV method like DADA2. If your truncation lengths were greater than 250, it is highly likely this is at least contributing to what you are observing.

benjjneb avatar Dec 15 '23 23:12 benjjneb

Hello @benjjneb Thank you very much for your technical services and suggestions.

biazen123 avatar Dec 17 '23 10:12 biazen123

The sequenced amplicon you are working with is only 251-256 nts, so there is not really any meaningful gain in length of merged amplicon from joining forward/reverse reads here, if you are keeping ~250 nts of forward read.

One thing to be aware of is that you need to truncate these reads at 250nts (or less), otherwise you read into the other primer and adapter, which will cause problems with an ASV method like DADA2. If your truncation lengths were greater than 250, it is highly likely this is at least contributing to what you are observing.

Dear Benjjneb I tried many times by changing the conditions for both F and R reads. but not improved due to the very low quality of the reverse reads. Therefore, I decided to continue using only the forward sequence (515F for V4). I want to be clear about processing the forward sequence via DADA2 in R, as a beginner for bioinformatics. what will be the expected length for V4 using only F primer and where it will be truncated? Is that based on the Qscore or minLen? Is that possible to use the default minLen = 50 or more? Thank you very much for your technical support. Regards

biazen123 avatar Dec 22 '23 13:12 biazen123

The length of the V4 amplicon using the most common primers for this region is 251-256 nts.

The most common V4 primer sets do not sequence the primers, so no need to use trimLeft or do any primer removal.

You can take a look at the DADA2 tutorial for an example of processing a V4 dataset. Typically the truncLen is chosen by looking at the quality scores via plotQualityProfile.

minLen won't matter, use truncLen instead.

benjjneb avatar Dec 22 '23 14:12 benjjneb

The length of the V4 amplicon using the most common primers for this region is 251-256 nts.

The most common V4 primer sets do not sequence the primers, so no need to use trimLeft or do any primer removal.

You can take a look at the DADA2 tutorial for an example of processing a V4 dataset. Typically the truncLen is chosen by looking at the quality scores via plotQualityProfile.

minLen won't matter, use truncLen instead.

Dear Benjjneb Thank you very much for your quick response and support. I am using the DADA2 tutorial script i,e, for the paired-end reads. I already cut the primers using cutadapt because the sequencing company informed the presence of primers. Now I am trying to process only the forward read through DADA2 with the following condition. Illumina Sequence length is 301. filterAndTrim(fnFs, filtFs, maxN=0, maxEE=2, truncQ=2, minLen = 240, rm.phix=TRUE, compress=TRUE, multithread=TRUE) # On Windows set multithread=FALSE

So is that necessary to use truncLen for only the forward read instead minLen = 240?

biazen123 avatar Dec 22 '23 14:12 biazen123

Yes, use truncLen instead of minLen.

benjjneb avatar Dec 22 '23 14:12 benjjneb

Yes, use truncLen instead of minLen.

Thank you very much for your technical support. I will proceed based on the suggestion.

biazen123 avatar Dec 22 '23 15:12 biazen123