dada2 icon indicating copy to clipboard operation
dada2 copied to clipboard

Shortening 16S seqs using "within" primer sets

Open EdoardoUibk opened this issue 1 year ago • 4 comments

Hello, first of all thank you for all your very useful replies from different posts.

I work with 16S datasets from different sequencing experiments using different primer sets and would like to compare the V4 region among them. I know that taxonomic assignment is sensible when using ASV because of the different-length-issue, but if the same region is retained in different datasets, then the taxonomic should be comparable.

For example: I have 16S Illumina sequences obtained with primers 341f and 1378r that I would like to shorten only retaining the sequences between the 515f and the 806r primers (V4 region). I changed all un-ambiguous nucleotides in the primers to N. First, I trim the sequences with input FWD and REV with the 515f and 806r, respectively, and nicely find that most (95%) of the seqs contain these sequences. they are successfully removed. However, when I do a sanity check re-using the 341f and 1378r as FWD and REV input, I still find some seqs appearing in the primer count table.

  1. why is this? Shouldn't all 16S seqs contain the "515f" and "806r" sequences and thus, when applying cutadapt, remove all the extremeties outside?
  2. If I wanted to continue, I could remove the 341f and 1378r primers, but then the sequences would be of different length compared to the one where 515f and 806r primers were already removed... How could I solve this? Maybe "trimLeft-f=x"/"trimLeft-r=y" in the "filterAndTrim" command instead? where X and Y are the number of nucleotide between the -3' of the 515f/806r primers and the -5' of the sequences...? Or, maybe even better, just before assigning taxonomy, removing the extra parts outside the "515f"/"806r" primers? (If so how?)

Thank you in advance.

EdoardoUibk avatar Jul 26 '22 13:07 EdoardoUibk

It sounds like your issue here is with the behavior of cutadapt? If so, this might not be the best place to look for assistance. The one thing I can offer is that you may want to take a look at using the --discard-untrimmed flag to cutadapt so that reads that weren't trimmed to 505F-806R are discarded.

ps: You'll want to solve this with cutadapt, trying to fix things afterwards with trimLeft will run into the issues of introducin glength variation that you noted in your comment.

benjjneb avatar Jul 28 '22 17:07 benjjneb

Thank you for the advice.

I gave 515F/806 as FWD/REV input and run cutadapt: for(i in seq_along(fnFs)) { system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2, "-m", 1, "--discard-untrimmed", "-o", fnFs.cut[i], "-p", fnRs.cut[i], fnFs.filtN[i], fnRs.filtN[i], ))}

All primers were removed. However, when I gave 341F (CTANGGGNNGCNNCAG) for FWD as "sanity-check", some sequences were still found in the output seqs...?

rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[1]]), FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.cut[[1]]),
REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.cut[[1]]), REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.cut[[1]]))

Forward Complement Reverse RevComp FWD.ForwardReads 2218 0 0 0 FWD.ReverseReads 0 0 0 2039 REV.ForwardReads 0 0 0 0 REV.ReverseReads 0 0 0 0

Maybe I'm missing something.. Thank you again.

EdoardoUibk avatar Aug 01 '22 10:08 EdoardoUibk

Is the number of detected primer sequences dramatically lower than there were present before cutadapt?

Some residual primer hits might just be down to differences in pattern matching methods and parameters between what cutadapt is using and what Biostrings is using. If it is the case that the vast majority of primers are being removed between before and after cutadapt, it may be OK to just move forward.

benjjneb avatar Aug 01 '22 16:08 benjjneb

Fantastic. In fact, the number of detected primers before cutadapt was much higher:

Forward Complement Reverse RevComp FWD.ForwardReads 43108 0 0 0 FWD.ReverseReads 0 0 0 36464 REV.ForwardReads 0 0 0 1996 REV.ReverseReads 48768 0 0 0

Thanks!

However, when I "truncLen=c(120,200)" instead of c(240,200) (because FWD seqs are shorter), and then I merge them, very very few seqs merge (e.g. 16 merged out of 67809). Shouldn't the overlapping part always be the same...?

table(nchar(getSequences(seqtab)))

200 239 257 295 296 298 299 300 11 1 1 1 2 12 385 21 I assume there is a problem on the overlapping part of the seqs, something about the mismatches allowed or the lengths of the overlapping... maybe I should check manually merging a couple of seqs to see the state? how could I do that? Maybe having the same problem than in #831

Thanks

EdoardoUibk avatar Aug 02 '22 08:08 EdoardoUibk