decontam icon indicating copy to clipboard operation
decontam copied to clipboard

Understanding NA scores and filtering requirements

Open HSapers opened this issue 4 years ago • 3 comments

Thank you for this package - I've been enjoying going through both the package documentation and the paper. I have 130 samples in three batches (three different extraction sets with three different sequencing runs). In each batch the number of c:s is as follows: 4:43, 4:33, 1:23 - extractions are both negative kit extractions and negative PCRs with the exception of the last batch (filter extraction with no kit, only PCR neg). I know there likely are not enough controls for robust identification of contaminant ASVs. I have two questions about interpreting the results:

  1. in the Analysis of Oral Mucosal 16S dataset supplement, there is a step for filtering out ASVs that were not observed in at least two samples. Won't this then miss ASVs that are only present in one control but not any samples? I don't want to filter the dataset like this since I have a lot of low abundance (<0.01%) taxa. I was filtering to identify contaminants and then subtracting those ASVs from the unfiltered samples. This predictably missed a number of obvious contaminants that were only present in one control (eg. we had one kit control obviously contaminated with methylobactertium and legionella, but since these were only present in one control, they were missed). I guess this is symptomatic of not having a robust number of controls.

  2. when viewing the contaminant df, I noticed a number of ASVs did not have a p.prev value or a p value and were assigned false. Is this where is and isnot contaminant function differently, the assumption being that these ASVs would be assigned false to isnot contaminant and therefore 'true asvs' if using inNotContaminant in low biomass samples? In the case where there was an NA, prev was 2, in unfiltered data (but also several ASVs with prev=2 that did have scores), in unfiltered data, prev=1, p.prev=NA. I think this is because with an ASV only observed in a single sample (S or C), there simply isn't enough information to generate a score? This gets back to my first question, what about ASVs that are observed with a high frequency, but low prevalence, eg - a contaminant with a high frequency in a single control?

Thank you

HSapers avatar Jun 28 '20 18:06 HSapers

in the Analysis of Oral Mucosal 16S dataset supplement, there is a step for filtering out ASVs that were not observed in at least two samples. Won't this then miss ASVs that are only present in one control but not any samples? I don't want to filter the dataset like this since I have a lot of low abundance (<0.01%) taxa. I was filtering to identify contaminants and then subtracting those ASVs from the unfiltered samples. This predictably missed a number of obvious contaminants that were only present in one control (eg. we had one kit control obviously contaminated with methylobactertium and legionella, but since these were only present in one control, they were missed). I guess this is symptomatic of not having a robust number of controls.

Depending on your goals you may not want to perform the min-2-sample filtering. However, the issue of not removing taxa that only are present in controls is not obviously a problem to me, as any subsequent analysis of the real samples won't be including those controls anyway?

when viewing the contaminant df, I noticed a number of ASVs did not have a p.prev value or a p value and were assigned false. Is this where is and isnot contaminant function differently, the assumption being that these ASVs would be assigned false to isnot contaminant and therefore 'true asvs' if using inNotContaminant in low biomass samples? In the case where there was an NA, prev was 2, in unfiltered data (but also several ASVs with prev=2 that did have scores), in unfiltered data, prev=1, p.prev=NA. I think this is because with an ASV only observed in a single sample (S or C), there simply isn't enough information to generate a score? This gets back to my first question, what about ASVs that are observed with a high frequency, but low prevalence, eg - a contaminant with a high frequency in a single control?

Yes, is and isNot start with different assumptions (everthing is real until proven otheriwse for isContaminant, and vice-versa for isNotContaminant), and thus taxa present in too few samples to build evidence one way or the other will be defaulted to the corresponding assumption. That's exactly the case for 1-sample ASVs like you said, too little information to generate a score.

benjjneb avatar Jun 29 '20 18:06 benjjneb

Thank you very much. I try to track contaminants and keep a running df of contaminant ASVs between all extractions and sequencing runs - so I wasn't thinking that I would effectively remove control-only ASVs by removing those samples from the experiment set and then all ASV reads == 0. It didn't occur to me right away that the ASVs identified by Decontam wouldn't include these and I would have to add them in to get a complete list of potential contaminants - my fault for not thinking critically about what was being classified - of course it's only ASVs present in 'true samples'.

If I first filter my input data to removed all ASV's only present in controls, the relative abundance would then be recalculated. Would it be better to remove these after running decontam? I'm not sure of the implications of changing the relative abundance of ASVs in the controls - I guess they would end up becoming consistently larger fractional abundances of control samples (if ASVs only in controls are removed). This would only be consistent between ASVs in the same control sample and not between samples. I guess this really doesn't matter since the decontam score is based on presence/absence and not fractional abundance?

Thank you

HSapers avatar Jun 29 '20 23:06 HSapers

For decontam prevalence purposes it won't matter if you remove them before or after, for the reason you mentioned.

In principle, would remove them all after.

benjjneb avatar Jun 29 '20 23:06 benjjneb