decontam icon indicating copy to clipboard operation
decontam copied to clipboard

Using quantile-quantile plots to identify threshold values (question and suggestion for documentation enhancement)

Open JacobRPrice opened this issue 3 years ago • 0 comments

Awesome job on decontam!

In the discussion section, under *Choice of classification threshold" its suggested that:

Another useful visualization is a quantile-quantile plot of scores versus the uniform distribution.

Looking at the documentation, I don't think an example of this is provided to assist with interpretation. It might be useful to provide an example, as well as the rationale behind threshold selection, especially for those with "weird" looking P* histograms.

For example, running isContaminant() on a dataset I'm working with yields the following histogram:

decontam_score_histogram

I believe the odd shape I'm seeing is an artifact of only having 3 blank samples (and 96 real samples), as well as quite a few ASVs that have prevalence = 2 (that is the largest bin located just above 0.50).

Focusing in on the lower half of the P* range looks like:

decontam_score_histogram_zoom

There doesn't appear to be a strong/clear break point that really stands out as an optimal threshold selection. Thankfully, these sequences comprise a tiny part of the dataset when looking at the total read abundances.

All data: decontam_score_histogram_Abund

Just the lower range: decontam_score_histogram_Abund_zoom

We are planning on carrying out differential abundance testing as well as testing for differences in alpha diversity, so I'm hesitant to go overboard in terms of filtering/preprocessing. Any suggestions on how I could approach this would be greatly appreciated!

JacobRPrice avatar May 28 '21 17:05 JacobRPrice