decontam icon indicating copy to clipboard operation
decontam copied to clipboard

Positive controls and decontam (reposting issue #83)

Open tiagojp opened this issue 3 years ago • 4 comments

Dear Benjamin,

I am reposting issue #83.

I have the following set of samples (16S rRNA reads), which were sequenced together on the Illumina (MiSeq 2 x 300 bp) platform: Soil: extracted/PCR/sequenced samples containing soil DNA Soil Blank: extracted/PCR/sequenced samples containing the Zymo extraction kit only (no DNA) PCR Soil Negative Control: PCR/sequenced samples containing PCR kit and water only (no DNA) PCR Soil Positive Control: PCR/sequenced samples containing PCR kit and a known ZymoMock community

Singleworm: extracted/PCR/sequenced samples containing microbiome DNA from single worms Singleworm blank: extracted/PCR/sequenced samples containing extraction buffer only. This buffer is the same used for the Singleworm samples, but different from the Zymo kit used for the soil samples. PCR Worm Negative Control: PCR/sequenced samples containing PCR kit and water only (no DNA) PCR Worm Positive Control: PCR/sequenced samples containing PCR kit and a known ZymoMock community (same as for PCR Soil Positive Control)

My questions are: 1- Since these samples were sequenced together (i.e., same Illumina run), should I consider all samples when using Decontam? Basically, should I consider all the negative controls (i.e., soil blank, PCR soil negative, Singleworm blank, PCR Worm Negative Control) when separating samples into "True" and "Control" samples?

Blank samples (soil and worm) involved three steps: 1- DNA extraction, 2-PCR, 3-Sequencing, whereas PCR controls (negative and positive) only steps 2-3. I guess this difference is not an issue, however, the kits/buffers differ between soil and single worms. So, do you think a breakdown of the dataset between soil and single worm samples (with their respective controls) is more appropriate?

2- What about the positive controls? Should I also consider them as "control samples" when using decontam? I think this could affect the "frequency" method since we would have a sample with high DNA concentration (at least compared to blank and negative controls) and potentially high number of reads/AVSs assigned to the "Control group” (see figure attached, only soil dataset). However, for the "prevalence" method it could be useful as it does not rely on the DNA concentration of samples, but instead on the presence/absence of a feature across samples. What are your thoughts on that? 16S_Soil_library-size-sample-control-annotated

Final comment on the difference between methods: All three methods (frequency, prevalence, and combined) are working properly on my dataset. The frequency method seems to find a larger number (146 true) of contaminants compared to “prevalence” (23 and 82 true for 0.1 and 0.5 thresholds) and combined (45 true for threshold 0.1). When using the combined method and higher threshold (e.g. 0.5), then Decontam finds a larger number of contaminants (540), but still in the order of magnitude of the frequency method. Regardless of the filtering method (and threshold values), the patterns across samples are still quite strong (i.e. significant differences among habitats/soil types).

Thanks a lot for your help!

Sincerely,

Tiago

tiagojp avatar Jan 06 '21 18:01 tiagojp

@tiagojp Quick comment, I apologize about missing your earlier issue #83. I'm not sure how that happened! More to come.

benjjneb avatar Jan 06 '21 18:01 benjjneb

First, I think it is good you are taking the idea of contamination very seriously given that the library sizes of your negative controls and true samples are intermixed -- this suggests that contamination could be a real issue.

(1) Contamination comes from a variety of sources, but in my experience the most important sources are from the reagents used in PCR and library preparation.

So, do you think a breakdown of the dataset between soil and single worm samples (with their respective controls) is more appropriate?

In theory, yes. However, how many negative controls do you have when you split your data that way? It is also important to have multiple negative controls in each batch to effectively identify contaminants.

2- What about the positive controls? Should I also consider them as "control samples" when using decontam? I think this could affect the "frequency" method since we would have a sample with high DNA concentration (at least compared to blank and negative controls) and potentially high number of reads/AVSs assigned to the "Control group” (see figure attached, only soil dataset). However, for the "prevalence" method it could be useful as it does not rely on the DNA concentration of samples, but instead on the presence/absence of a feature across samples. What are your thoughts on that?

I don't disagree with your reasoning. They should not be used in the frequency method. They perhaps aren't ideal for the prevalence method either (which was developed for negative controls) but would still kind of work.

benjjneb avatar Jan 06 '21 18:01 benjjneb

Dear Benjamin,

Thanks a lot for your feed back. See my answers below.

(1) Contamination comes from a variety of sources, but in my experience the most important sources are from the reagents used in PCR and library preparation.

I totally agree. In my dataset it is quite clear that both PCR and Extraction controls have potential contaminants due to the kits/reagents used. In fact, it seems I have an additive effect since Extraction controls have more true contaminants than PCR controls (something expected). That is true for both soil and single worm samples.

In theory, yes. However, how many negative controls do you have when you split your data that way? It is also important to have multiple negative controls in each batch to effectively identify contaminants.

I still have plenty of controls (both PCR and Extraction) when separating the dataset into soil and worm datasets. Also, by inspecting the library size for the entire dataset (i.e. soil and worm combined) it is clear that the number of potential contaminants in soil and worm Extraction controls also differs (higher in the latter), thus suggesting an effect of the kits/reagents used for these two extraction methods.

I don't disagree with your reasoning. They should not be used in the frequency method. They perhaps aren't ideal for the prevalence method either (which was developed for negative controls) but would still kind of work.

This is very helpful. I can certainly redo these analyses by removing it and see how it impact the results!

Best,

Tiago

tiagojp avatar Jan 06 '21 19:01 tiagojp

Hope that helped, please comment again if you have more questions! (that will ping my email which is basically my to-do list).

Also, by inspecting the library size for the entire dataset (i.e. soil and worm combined) it is clear that the number of potential contaminants in soil and worm Extraction controls also differs (higher in the latter), thus suggesting an effect of the kits/reagents used for these two extraction methods.

Again, you've identified a potential contaminant issue that relates to different batches (processing-wise) and to different experimental conditions.

You have identified something very important! I would suggest that the contaminant portion of that (unfortunately) won't be sufficient, but that you should probably also include a "study" term in future linear models of the outcome you are considering.

benjjneb avatar Jan 06 '21 19:01 benjjneb