dada2 icon indicating copy to clipboard operation
dada2 copied to clipboard

intuition for appropriate filtering parameters

Open wdwvt1 opened this issue 7 years ago • 33 comments

Working with several different sequencing runs of various qualities I am trying to develop intuition for what reasonable quality filtration parameters are in the function filterAndTrim. The relevant parameters are truncLen, maxEE, and maxN (based on my reading).

Looking through closed issues, I would summarize the guidance as follows:

Thread Maintain at least 10% of raw reads after all filtration steps (filterAndTrim as well as downstream merging). Use maxEE as the primary filtering parameter. Values from 2 to 6 are shown in various issues. Run a subset of samples through the entire pipeline to determine if full pipeline produces too few features.

Thread 60% of reads passing filterAndTrim is good.

Thread Finding the best trimming location may just mean doing a grid search.

Thread For merging reads, don't go below 8 nucleotide overlap.

Thread Chimeras as a percentage of sequences should be less than 30% generally.

My situation is the following: 300bp paired end reads (full overlap of forward and reverse read) with forward reads generally much higher quality than reverse reads. Forward read image Reverse read image

Forward and reverse reads seem like they ought to successfully span the full amplicon at fairly high quality using truncLen(c(225, 125). However, using truncLen(c(225, 125), maxEE=c(2,2), truncQ=11, I am losing 55-65% of sequences on average.

Two questions:

  1. Is this a reasonable number of sequences to lose at the filter step? Do I need to do a grid search through length, maxEE, and truncQ with some sort of function that maximizes sequences without dropping quality below a certain value? What maxEE/min truncQ would you use during this search?

  2. In your opinion, when you have fully overlapping amplicons, is paired end merging worthwhile? On one hand, given that forward reads will have low quality towards the end and reverse reads high quality at the beginning, it seems that merging ought to give me full amplicons at reasonable confidence. On the other hand, if I just used the entire forward read (or even the first 250 nt), would I capture about 90% of what I'd get with a 300bp amplicon, and significantly reduce my computational headaches?

As usual, thanks for the great tool and the helpful feedback. The issues really are nice to have around to get a handle for the thinking behind the parameters etc.

wdwvt1 avatar Apr 28 '17 01:04 wdwvt1