decontam
decontam copied to clipboard
probable bug plot_frequency
Recently I have been exploring the plot_frequency()
-function in a bit more depth and I found two issues:
- Sometimes the function yields an error when the
neg
-parameter is set (this does not occur withneg
= NULL (default)).
Example (based on tutorial):
ps <- readRDS(system.file("extdata", "MUClite.rds", package="decontam"))
sample_data(ps)$is.neg <- sample_data(ps)$Sample_or_Control == "Control Sample"
plot_frequency(ps, taxa_names(ps)[c(1, 2)], conc = "quant_reading", neg = "is.neg")
# Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# NA/NaN/Inf in 'y'
I found out this has to do with a few lines of code in the plot_frequency()
-function, namely:
...
for (tax in names(mod_melts)) {
newdata <- data.frame(logc = logc, taxa = tax, DNA_conc = exp(logc))
freq <- mod_melts[[tax]]$taxon_abundance
conc <- mod_melts[[tax]]$DNA_conc
df <- data.frame(logc = log(conc), logf = log(freq))
df <- df[!neg | is.na(neg), ]
df <- df[freq > 0, ]
...
In essence, the freq
-object is similar in length to the number of rows in mod_melts
(reflecting all samples/controls). This is also the basis for df
.
However, the following two lines puzzle me, as first the negative controls are excluded from df
(reducing the number of rows), after which the - original length - freq
-object is used for further selection (freq > 0
). These dimensions do not match and will introduce NA-rows, resulting in the error above.
I figured this does not happen when samples in the original phyloseq are ordered in such a way that all samples are followed by all controls.
To test this idea I reordered the original phyloseq:
ps_reorder <- ps
otu_table(ps_reorder) <- otu_table(ps_reorder)[order(sample_data(ps_reorder)$is.neg),] # controls last
plot_frequency(ps_reorder, taxa_names(ps_reorder)[c(1, 2)], conc = "quant_reading", neg = "is.neg")
# no error
- Following up on 1., I found that setting the
neg
-parameter results in model behaviour that deviates from the idea ofis.Contaminant
. As far as I understood from the paper, the frequency-method uses all density information regardless whether a sample is a biological sample/control sample. However, when setting theneg
-parameter, control samples are excluded from the model (e.g. the red line only represents a fit between DNA concentration and frequency in biological samples). This is not the case when not setting theneg
-parameter.
Example (continuing from above):
p1 <- plot_frequency(ps_reorder, taxa_names(ps_reorder)[c(1, 2)], conc = "quant_reading", neg = "is.neg")
# Does not include blanks for models; does not align with is.Contaminant()-function
p2 <- plot_frequency(ps_reorder, taxa_names(ps_reorder)[c(1, 2)], conc = "quant_reading")
# Does include blanks for models
cowplot::plot_grid(p1, p2, ncol = 1, align = "v", axis = "lr")
Compare the red line between upper (neg
-parameter set) and lower (neg
-parameter not set) plots. It is slightly different, showing that only in the case of p2
the blanks were included in the modelling (as I believe should be the case, given the idea of is.Contaminant()
).
Curious to hear your thoughts on this and happy to help adjust if needed.