dada2 icon indicating copy to clipboard operation
dada2 copied to clipboard

Different sequence runs with different sequence lengths

Open larajan67 opened this issue 2 years ago • 8 comments

Hello, I am relatively new to DNA sequencing analyses so I appreciate the detailed tutorials and feedback on previous issues.

I have 16s rRNA data from 3 different sequence runs using the same primers. I have run all runs through the dada2 workflow using the same parameters for the 'FilterandTrim' step. 2 of the runs have reads mostly with a sequence length of ~214, while the 3rd run has reads mostly with a sequence length of ~239. I want to merge these runs together into a single dataset for downstream analysis, yet this difference of sequence lengths is an issue (see graph). readsbylength

What can I do to resolve this?

larajan67 avatar Jun 14 '22 20:06 larajan67

Do these different sequencing runs use the same library preparation approach?

Some amplicon library preps will include the primers on the sequenced part of the reads, while others will not. Even if the primers used are the same, these different library prep approaches will lead to difference in the sequenced amplicon lengths of the total length of the two primers, which is usually ~30 nts.

That's my guess for what is going on here. It's fixable, but first it would be good to confirm or disprove that possibility. One can do this by simply inspecting the raw sequences from each run... is the primers sequence present at the start of the reads or not?

benjjneb avatar Jun 15 '22 16:06 benjjneb

I have confirmed with my collaborators, who led the sequencing, that same library preparation approach was used.

I did find that the specific cutadapt code did not remove the primers in this 3rd run as it had for the first 2 runs. I slightly edited the cutadapt code and the primers were mostly removed (less than 1% remaining).

However, there remains still a difference in sequence length between the 3 runs: 1 & 2-214; 3-231 (see graph). Do you have any other ideas? Thank you for your help. lengthbyreads

larajan67 avatar Jun 17 '22 20:06 larajan67

I'd recommend removing the cutadapt step that seems to be introducing this spurious length variation. Just use trimLeft in filterAndTrim to remove the primers instead.

benjjneb avatar Jun 17 '22 23:06 benjjneb

I used the trimLeft in filterAndTrim instead of cutadapt. However, the sequence length difference remains with a difference of 14 between runs 1/2 and run 3. Any other ideas? Thank you, Lara

larajan67 avatar Jun 20 '22 18:06 larajan67

I would try aligning the most abundant ASVs in run 1 vs. run 3, and trying to identify where the additional sequence length observed in run 3 might be. I'm still betting it is some technical issue, but maybe not?

benjjneb avatar Jun 20 '22 22:06 benjjneb

I apologize for the delayed response. I had another project with higher priority.

As I am relatively new to dada2, I would like to double check my approach for alignment of specific sequences. Would this be the best approach after filtering for the most abundant ASVs? (see below) sequences<-getSequences(seqtab.nochim_most) alignment <- AlignSeqs(DNAStringSet(sequences), anchor=NA) phang.align <- phyDat(as(alignment, "matrix"), type="DNA")

Then I would examine the phang.align output?

larajan67 avatar Jul 05 '22 15:07 larajan67

That works. The signal we are looking for here is pretty crude (where are the extra bases in run 3 vs run 1?), so any approach that lets you view the alignemnt of a few top ASVs from each run should do the trick.

benjjneb avatar Jul 12 '22 20:07 benjjneb

Here are snapshots from the output of 'BrowseSeqs' for Run 1 (top) and Run 3 (bottom) for the most abundant ASVs. What do you recommend? Screen Shot 2022-08-06 at 3 43 14 PM Screen Shot 2022-08-06 at 3 43 28 PM

larajan67 avatar Jul 14 '22 22:07 larajan67