decontam icon indicating copy to clipboard operation
decontam copied to clipboard

Three different types of controls

Open Jasmine84 opened this issue 4 years ago • 9 comments

Hello Benjamin,

I have two questions for you regarding the prevalence function in Decontam at ASV level.

QUESTION 1: I assume that the control samples should be the same (resemble each other in type of control). So, I have divided my decontam filtration in three steps - one step for each control contaminant filtration/identification - as I have three different types of controls.

First, I check for contaminants in the first control (Control1) and then use this object after contaminant filtration to run it again with the second control (Control2) and so on - in total three times. However, I don't know if it would be better to just make one run and only "one" control.

I have done it both ways and I identify more contaminants with the three-step method than one-run method. I think it is best to choose the three-step method, however, I am uncertain about how decontam uses the other control samples.

Fx look at the metadata below and R-syntax. Will decontam be confused about Control2 and Control3? As I read the R-codes - Step 1 is to define the control, step 2 to make a table for presence/absence of the different ASV - are Control2 and Control3 regarded as samples here?

QUESTION 2: 2. To decide which threshold that is appropiate to use, I have plotted the histogram for p-scores, (see screenshoot). Any threshold between 0.1 and 0.5 is appropriate, however, 0.4 or 0.5 is best, is it correct?

Type: OR Use Type: Control1 Control Control2 Control Control3 Control Sample Sample Sample Sample

Step 1 - Define your control in sampledata (first Control1) sample_data(data)$is.neg <- sample_data(data)$Type == "Control1"

Step 2a - threshold 0.1 contamdf.prev01 <- isContaminant(data, method="prevalence", neg="is.neg", threshold=0.1, detailed = TRUE) table(contamdf.prev01$contaminant)

Step2b - threshold 0.5 contamdf.prev05 <- isContaminant(data, method="prevalence", neg="is.neg", threshold=0.5) table(contamdf.prev05$contaminant)

Step 3 - Look at the histogram for choosing the threshold cutoff

# https://github.com/benjjneb/decontam/issues/41 should be a plot with a rougly two clear arms with true and contaminant taxa

image

hist(contamdf.prev05$p, 100, main="P-values for different thresholds in decontam - Control1") you can choose 0.1 to 0.5 due to low scores - probable choose 0.1

Step 4 ps.pa <- transform_sample_counts(data, function(abund) 1*(abund>0)) ps.pa.neg <- prune_samples(sample_data(ps.pa)$Type == "Control1", ps.pa) ps.pa.pos <- prune_samples(sample_data(ps.pa)$Type == "Sample", ps.pa)

Thanks for a great R-package and in advance for the help with the above! Best regards, Jasmine

Jasmine84 avatar Nov 20 '19 18:11 Jasmine84

Will decontam be confused about Control2 and Control3? As I read the R-codes - Step 1 is to define the control, step 2 to make a table for presence/absence of the different ASV - are Control2 and Control3 regarded as samples here?

If you provide 'NA' values, those samples will be excluded from the contaminant model comparisons. So you could put NA for the Control2 and Control3 sample types and it should then work as expected, using just Control1 type samples alone and ignoring Control2/3 samples entirely.

benjjneb avatar Nov 20 '19 21:11 benjjneb

To decide which threshold that is appropiate to use, I have plotted the histogram for p-scores, (see screenshoot). Any threshold between 0.1 and 0.5 is appropriate, however, 0.4 or 0.5 is best, is it correct?

Your procedure looks correct to me, and the range of thresholds from 0.1 to 0.5 are all justifiable. To pick the "best" one you can try to inspect the additional contaminants identified by raising the threshold, and seeing if they seem consistent with being contaminants, e.g. by being in unexpected or commonly-seen-as-contaminant genera. But saying what the exact "best" threshold is hard.

I would suggest that this data suggests you would benefit from having more controls. Prevalence based contaminant idenfication is limited by the number of negative controls, and we strongly recommend having more such controls given the commonly observed non-uniformity of contaminants in replicate negative controls.

benjjneb avatar Nov 20 '19 21:11 benjjneb

Looking at this histogram, I would interpret higher thresholds for my cutoffs. If my distribution looked like that, my understanding would be to try 0.5 and 0.6 as cutoffs. A cutoff of 0.1 seems like it would include nearly all ASVs.

Am I misunderstanding this?

Also, it sounds like you are suggesting lumping the controls "I would suggest that this data suggests you would benefit from having more controls", however, my understanding from the OP is that their controls are different, for an example of 3 different control types: blank controls, blank controls with sterile swabs added, and blank controls with some open exposure to the workbench. In such a case, I would think treating them as separate and decontam'ing step-wise would be more appropriate, as there may not be much agreement between the 3 control types.

What do you think?

Ziggsblitz avatar Nov 21 '19 17:11 Ziggsblitz

Looking at this histogram, I would interpret higher thresholds for my cutoffs. If my distribution looked like that, my understanding would be to try 0.5 and 0.6 as cutoffs. A cutoff of 0.1 seems like it would include nearly all ASVs.

Am I misunderstanding this?

I shouldn't have included 0.1 in my previous comment. The range 0.2 to 0.5 is all justifiable. I would tend towards a threshold in the 0.4 to 0.5 range too. In general, I would avoid thresholds above 0.5 unless I have a strong prior that a large fraction of the sample is contamination, because any score > 0.5 indicates that the non-contaminant model was better than the contaminant model.

benjjneb avatar Nov 22 '19 22:11 benjjneb

Also, it sounds like you are suggesting lumping the controls "I would suggest that this data suggests you would benefit from having more controls", however, my understanding from the OP is that their controls are different, for an example of 3 different control types: blank controls, blank controls with sterile swabs added, and blank controls with some open exposure to the workbench. In such a case, I would think treating them as separate and decontam'ing step-wise would be more appropriate, as there may not be much agreement between the 3 control types.

I just meant more controls here, not grouping them necessarily. Probably with more controls you will get a stronger separation into bimodal distributions, rather than the flatt-ish tail at low scores here.

benjjneb avatar Nov 22 '19 22:11 benjjneb

@benjjneb Thanks! I have tried it with more controls and it looks better - more clear cut-off at 0.3 to 0.4. However, I have looked at the number of ASVs in my three control samples and I find 867 ASVs out of 41000 ASVs in samples and controls. With the prevalence method isContaminant I ídentiy 294 ASVs in three-step method. The input is an OTU table, raw counts, and not the relative abundance as there is a R-code for that in decontam: ps.pa <- transform_sample_counts(ps, function(abund) 1*(abund>0)) Is that understand correctly? And another question : if I should use the isnotcontaminant method in stead of iscontaminant?

Jasmine84 avatar Nov 25 '19 21:11 Jasmine84

The input is an OTU table, raw counts, and not the relative abundance as there is a R-code for that in decontam:

Yep that's fine, decontam will convert the raw counts or relative abundances to presence/absence just the same.

And another question : if I should use the isnotcontaminant method in stead of iscontaminant?

Nope. isNotContaminant is intended for situations in which most of the DNA in the samples is contamination, i.e. extremely low microbial biomass or potentially zero microbial biomass samples. That's not your situation, so you should use the normal isContaminant function.

benjjneb avatar Nov 25 '19 23:11 benjjneb

Super, thanks! I have another question regarding understanding the p-threshold:

I previously used the threshold 0.5 at all three steps above even though the histogram for p-scores did not suggest it. I wanted to be sure to exclude the contaminants. This was before I discovered the histogram p-threshold plot for evaluation of which threshold to use. However, I am not sure I understand the threshold. You write in the article that contaminants are in a higher frequency in the controls - which I would interpret as using a threshold 0.5. Can you elaborate on this?
Davis et al. Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome. 2018.

Jasmine84 avatar Nov 26 '19 18:11 Jasmine84

You write in the article that contaminants are in a higher frequency in the controls - which I would interpret as using a threshold 0.5.

Yes, the threshold of 0.5 for the prevalence method has that simple interpretation.

Can you elaborate on this?

More stringent thresholds (e.g. 0.1) require a higher level of evidence supporting that the underlying prevalence in the controls is higher than in the true samples. Remember, the observed prevalence is just an estimate of the true underlying prevalence which is what we really are interested in. So, if say an ASV appears in 1 out of 2 contorls, and 49 out of 100 controls, it is observed to have a higher prevalence in the controls (50% vs. 49%) but our confidence that the true prevalence in the controls is higher is low since it is based on so few observations.

benjjneb avatar Nov 27 '19 14:11 benjjneb