decontam icon indicating copy to clipboard operation
decontam copied to clipboard

Optimal data for the frequency method?

Open mniku opened this issue 5 years ago • 4 comments

I’m slightly uncertain which DNA measures are optimal for the frequency based method (especially in case of samples containing animal tissues, where the proportion of microbial to animal DNA is often small and/or highly variable):

  • 16S qPCR measurements of the original DNA samples

  • DNA concentrations of the PCR products generated for Illumina sequencing (1st round / 2nd round PCR)

Obviously only the qPCR data tells the actual original amounts of starting material. But on the other hand, there are many steps between this and the final sequence data, so that the final DNA amounts used in the sequencing are something completely different.

How should we evaluate the applicability of frequency based method in specific cases? Such as, how high read counts in negative controls vs. actual samples are acceptable for the statistics?

mniku avatar Nov 06 '18 13:11 mniku

We believe that both types of DNA quantitation data will work. We have more testing using the DNA concentration post-PCR and prior to sequencing, simply because that data is more often available as it is generated "for free" as part of the usual sequencing workflows anyway. But in the more limited testing on qPCR data the method still seems to work, and other publications report strong patterns of inverse frequency of contaminants using qPCR data - which is the pattern the frequency method relies on.

How should we evaluate the applicability of frequency based method in specific cases? Such as, how high read counts in negative controls vs. actual samples are acceptable for the statistics?

The simplest and most useful evaluation is to inspect the distribution of scores assigned by the method. The expectation is that there will be a strong mode at low scores. In the cleanest cases the distribution will be clearly bimodal, while in other datasets the high-score mode is more wide and diffuse. However, the low-score mode should be there, and should be used to set the P* score threshold for identifying contaminants.

Another method is to simply inspect a few of the identified contaminants using the plot_frequency function.

benjjneb avatar Nov 07 '18 15:11 benjjneb

Thanks, this is now clear!

mniku avatar Nov 07 '18 15:11 mniku

How would the pre-pooling DNA concentration be valuable if samples are pooled to be equimolar? Would using that data still make sense? I ran my samples together with other people that have different equimolar concentrations. In that case, would that be a more valuable data to use? Thank you!

rturba avatar May 14 '20 00:05 rturba

@rturba

How would the pre-pooling DNA concentration be valuable if samples are pooled to be equimolar? Would using that data still make sense?

Yes. Pre-pooling DNA concentrations still track the fraction of the sample reads that derive from teh sample vs. from contaminants.

I ran my samples together with other people that have different equimolar concentrations. In that case, would that be a more valuable data to use?

Probably not. For identifying contaminants, you want to use samples that shared the same sample preparation history. Samples that were prepared differently will typically just be noise.

benjjneb avatar May 14 '20 00:05 benjjneb