decontam
decontam copied to clipboard
Low bacterial biomass and several batches
Hi, I'm working with data with loads of human DNA and little, if any, bacterial DNA. To avoid cross-sample contamination, we have performed DNA extraction and library prep alternating a tissue sample with a water sample, so we have about as many negative controls as real samples. That has on the other hand spread our samples over 6 batches.
What's the best way to process this? isContam to allow for batch effects? isNotContam on each run individually? Or maybe there's a good reason for not including batch effects in the isNotContam routine?
What's the best way to process this? isContam to allow for batch effects? isNotContam on each run individually? Or maybe there's a good reason for not including batch effects in the isNotContam routine?
A caveat, I have some suggestions, but they are based on still limited experience in extremely low biomass samples (i.e. contaminants >> sample) so don't take them as gospel.
The first thing I would suggest is to remove as much human DNA by mapping against the human reference genome as possible, and then to consider a second pass where you assign taxonomy to the remaining reads and remove those that get classified as Eukaryotes. When you know the contaminating source, this sort of approach is more powerful than any de novo approach can really hope to be (e.g. decontam).
Now, that probably won't solve all your problems, as some human DNA that looks like bacterial DNA can slip through, and you'll still have bacterial contaminants. If it is still the case that contamination >> sample, then I would use the isNotContaminant
approach to identify the best candidate non-contaminants. I would consider it to be a ranking rather than a simple classifier, at least at first, and would evaluate the top (lowest) scores by that method versus random draws from the sequence pool to get a sense of whether it is working effectively.
I think, given your design, that I would do this on a per-batch basis based on my previous observations that contaminants can differ quite a bit between sequencing runs. However, I would probably also try just pooling everything together as well, and checking if there are major discrepancies or not.
Hope that helps, and again consider this just as my best suggestions based on limited real experience in your situation. We'd be happy to hear back on what you find, good or bad, as well!
Thanks for the reply!
I had already done most of the steps above, including removing all 18S and mitochondrial sequences. The insight I had today was taking it to genus level, rather than trying to work with ASV. That, combined with doing isNotContaminant
per run gave interesting results. They key really was to look at the p-values, rather than the binary classification. Which leads me to my next question.
Could you motivate using different p-cutoffs for different plates? The reason I'm asking is because at p<=0.05 2 of my 6 plates have no signal at all, while at p<=0.1 one of the plates gets Pseudomonas everywhere. Since different plates have different backgrounds, maybe it makes sense that the cutoffs are different as well. But it would be nice to have some motivation other than "We picked the p-values that looked pretty"!
Could you motivate using different p-cutoffs for different plates? The reason I'm asking is because at p<=0.05 2 of my 6 plates have no signal at all, while at p<=0.1 one of the plates gets Pseudomonas everywhere. Since different plates have different backgrounds, maybe it makes sense that the cutoffs are different as well. But it would be nice to have some motivation other than "We picked the p-values that looked pretty"!
I think it's important to remember that contamination is a heterogeneous issue, that is there are multiple possible sources of contaminants, some of which can have quite different statistical signatures. decontam works quite well to remove contaminants in some studies, but there are types of contaminants such as cross-contamination, or contaminants that are so prevalent that they appear in every sample, that it will not be as effective in identifying. Thus, I think using additional information based on your expert knowledge of the system is completely appropriate, as long as it is appropriately described and justified.
And a related note, although the prevalence method score can be interpreted as a p-value, the frequency score cannot, and because deviations from the assumptions of our contaminant model can exist, we recommend that the scores from either method should generally be interpreted as scores useful for classification but not as p-values (i.e. not as guarantees on a Type 1 error rate).