mia icon indicating copy to clipboard operation
mia copied to clipboard

Critical issue with mergeFeaturesByPrevalence

Open antagomir opened this issue 1 year ago • 13 comments

Unexpected and potentially critical behavior with mergeFeaturesByPrevalence.

Let's first prepare example data.

library(mia)
data(GlobalPatterns, package="mia")
tse <- GlobalPatterns
tse <- transformAssay(tse, assay.type="counts", method="relabundance")

Here we merge features by prevalence, and return the result at the family level.

nrow(mergeFeaturesByPrevalence(tse, rank="Family", assay.type="relabundance", detection  = 0.5/100, prevalence  = 20/100))
#> 35

Here we merge features first by Family level grouping, then by prevalence.

altExp(tse, "Family") <- mergeFeaturesByRank(tse, rank="Family")
nrow(mergeFeaturesByPrevalence(altExp(tse, "Family"), assay.type="relabundance", detection  = 0.5/100, prevalence  = 20/100))
#>2

Same happens when we treat Family as a group, rather thank rank:

altExp(tse, "Family2") <- mergeFeatures(tse, f="Family")
nrow(mergeFeaturesByPrevalence(altExp(tse, "Family2"), assay.type="relabundance", detection  = 0.5/100, prevalence  = 20/100))
#>2

Interestingly, checking the prevalences manually yields even different numbers.. :-D

sum(rowMeans(assay(altExp(tse, "Family"), "relabundance") > 0.5/100) > 0.2)
#>26

sum(rowMeans(assay(altExp(tse, "Family2"), "relabundance") > 0.5/100) > 0.2)
#>19

All these approaches should be safely expected to yield the same result, I think. Failing to do so may lead to critical bugs in analyses.

Perhaps the reason is obvious but at first this seems very confusing.

To be investigated and fixed as a top priority.

antagomir avatar Aug 12 '23 09:08 antagomir