mia
mia copied to clipboard
Critical issue with mergeFeaturesByPrevalence
Unexpected and potentially critical behavior with mergeFeaturesByPrevalence
.
Let's first prepare example data.
library(mia)
data(GlobalPatterns, package="mia")
tse <- GlobalPatterns
tse <- transformAssay(tse, assay.type="counts", method="relabundance")
Here we merge features by prevalence, and return the result at the family level.
nrow(mergeFeaturesByPrevalence(tse, rank="Family", assay.type="relabundance", detection = 0.5/100, prevalence = 20/100))
#> 35
Here we merge features first by Family level grouping, then by prevalence.
altExp(tse, "Family") <- mergeFeaturesByRank(tse, rank="Family")
nrow(mergeFeaturesByPrevalence(altExp(tse, "Family"), assay.type="relabundance", detection = 0.5/100, prevalence = 20/100))
#>2
Same happens when we treat Family as a group, rather thank rank:
altExp(tse, "Family2") <- mergeFeatures(tse, f="Family")
nrow(mergeFeaturesByPrevalence(altExp(tse, "Family2"), assay.type="relabundance", detection = 0.5/100, prevalence = 20/100))
#>2
Interestingly, checking the prevalences manually yields even different numbers.. :-D
sum(rowMeans(assay(altExp(tse, "Family"), "relabundance") > 0.5/100) > 0.2)
#>26
sum(rowMeans(assay(altExp(tse, "Family2"), "relabundance") > 0.5/100) > 0.2)
#>19
All these approaches should be safely expected to yield the same result, I think. Failing to do so may lead to critical bugs in analyses.
Perhaps the reason is obvious but at first this seems very confusing.
To be investigated and fixed as a top priority.
I suggest that @ake123 checks and @TuomasBorman can review?
Good to first diagnose the issue and discuss the solution.
In both cases, the low-level features should be aggregated to the higher level (here, Family) before agglomerating the rare groups. This is one possible source for the behavior.
Also some additional unit tests will be warranted here.
One extra thing thing I noticed on the mergeFeaturesByPrevalence
is that it actually takes total row number but not the actual values. In the above example it can be seen that it has 35 but it should be only be 34.
nrow(mergeFeaturesByPrevalence(tse, rank="Family", assay.type="relabundance", detection = 0.5/100, prevalence = 20/100))
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
[1] 35
Also I checked with other ranks such as "Class" and all the numbers are equal except the one I mentioned above. All look good on other ranks and which I have checked such as "Class","Phylum". The numbers differ when it comes to "Family" and "Order" and still couldn't find out why and I keep on checking on the possible reasons.
In the helper function .agg_for_prevalence
I see the following
args[["na.rm"]] <- TRUE
This means any missing values in the data will be ignored during the calculation. i tested it by setting it to FALSE
and the number went down from 35 to 27.
Not sure where the NAs come fromsince the assay itself does not seem to have any NAs:
sum(is.na(assay(tse, "relabundance"))) 0
Usually good strategy is to calculate manually what is needed and then see where the differences are in each deviating case.
Here are the cases that the numbers fluctuate
First case is when the user includes na.rm = TRUE
for mergeFeaturesByRank
or na.rm=FALSE
for mergeFeaturesByPrevalence
nrow(mergeFeaturesByPrevalence(tse, rank="Family", assay.type="relabundance", detection = 0.5/100, prevalence = 20/100,na.rm=FALSE))
[1] 27
altExp(tse, "Family") <- mergeFeaturesByRank(tse, rank="Family",na.rm = TRUE)
> sum(rowMeans(assay(altExp(tse, "Family"), "relabundance") > 0.5/100) > 0.2)
[1] 19
altExp(tse, "Family2") <- mergeFeatures(tse, f="Family",na.rm = FALSE)
> sum(rowMeans(assay(altExp(tse, "Family2"), "relabundance") > 0.5/100) > 0.2)
[1] 19
Second case
Also when removing NA
from empty.fields in mergeFeaturesByRank
with signature = c(x = "SummarizedExperiment")
empty.fields = c(NA, "", " ", "\t", "-", "_")
@TuomasBorman what's your idea and suggestion.
This is a critical high-importance issue to sort out. I suggest that @TuomasBorman will prioritize to review this one with Jeba asap.
Great news, mergeFeaturesByPrevalence
is working as expected.
Prepare data
library(mia)
data(GlobalPatterns, package="mia")
tse <- GlobalPatterns
tse <- transformAssay(tse, assay.type="counts", method="relabundance")
Calculate number of rows after merging with mergeFeaturesByPrevalence
tse1 <- mergeFeaturesByPrevalence(tse, rank="Family", assay.type="relabundance", detection = 0.5/100, prevalence = 20/100)
nrow(tse1)
#> 35
When we used mergeFeaturesByRank
+ mergeFeaturesByPrevalence
, we noticed that the number of rows do not match with the original one (35 vs 2). However, mergeFeaturesByRank
has also rank parameter which default value is the highest rank, in this case Kingdom.
altExp(tse, "Family") <- mergeFeaturesByRank(tse, rank="Family")
tse2 <- mergeFeaturesByPrevalence(altExp(tse, "Family"), assay.type="relabundance", detection = 0.5/100, prevalence = 20/100, rank = "Family")
nrow(tse2)
#> 35
We can also see that assays match between these two objects.
# Check assay between tse1 and tse3 --> should be the same
all( assay(tse1) == assay(tse3) )
--> mergeFeaturesByRank
+ mergeFeaturesByPrevalence
is working as expected. The result is the same.
But then we also noticed that there is problem with mergeFeatures
+ mergeFeaturesByPrevalence
.
altExp(tse, "Family2") <- mergeFeatures(tse, f="Family")
tse3 <- mergeFeaturesByPrevalence(altExp(tse, "Family2"), assay.type="relabundance", detection = 0.5/100, prevalence = 20/100, rank = "Family")
nrow(tse3)
#>36
If we do the same trick as we did to solve the first issue, the number of rows is 36. There is one additional feature. Why?
mergeFeatures
works differently than mergeFeaturesByRank
. The latter finds the first available higher rank to agglomerate the feature, but mergeFeatures just finds all the unique groups in Family level and merges into these groups. Moreover, it completely drops off NA group (features that have NA value in Family level).
# The number of rows differ. If feature is NA in Family level, it is completely dropped off.
altExp(tse, "Family")
altExp(tse, "Family2")
This is why these sums differ:
sum(rowMeans(assay(altExp(tse, "Family"), "relabundance") > 0.5/100) > 0.2)
#>26
sum(rowMeans(assay(altExp(tse, "Family2"), "relabundance") > 0.5/100) > 0.2)
#>19
But when we agglomerate to a level that has no NA values such as Phylum in this case, the result is same.
sum(rowMeans(assay(agglomerateByRank(tse, rank="Phylum"), "relabundance") > 0.5/100) > 0.2)
#>10
sum(rowMeans(assay(mergeFeatures(tse, f="Phylum"), "relabundance") > 0.5/100) > 0.2)
#>10
Things to discuss:
- Should the default value of
rank
inmergeFeaturesByRank
be the highest rank? I think it should be no taxonomy rank agglomeration, i.e.NULL
. - Should
mergeFeatures
return also NA group (the features that has no grouping value)? I think there should bena.rm
parameter, because user might want to exclude features with no group. The default could be that NA group is included.
-
I agree that the default should be no agglomeration. Otherwise this can lead to confusing results.
-
Agreed.
Note: The mergeFeaturesByRank
has also the argument onRankOnly
which limits agglomeration to the specified rank only (not looking at higher-level ranks in the case of NAs). If this is FALSE, then the results won't be comparable to mergeFeatures
in a general case, even if the NA
group is added to mergeFeatures
:
altExp(tse, "Family3") <- mergeFeaturesByRank(tse, rank="Family", onRankOnly=TRUE)
altExp(tse, "Family3b") <- mergeFeaturesByRank(tse, rank="Family", onRankOnly=FALSE)
c(nrow(altExp(tse, "Family2")),
nrow(altExp(tse, "Family3")),
nrow(altExp(tse, "Family3b")))
# [1] 334 334 603
By default onRankOnly=FALSE
. This is potentially useful since it helps to keep potentially valuable data in, and the user can make a conscious decision to discard those by setting this to TRUE
. On the other hand, the default behavior is now something else than what the function name says (because it returns also other features than ones on the specific rank) and it leads to results that differ from those obtained by other similar means (e.g. mergeFeatures
).
Therefore, one more thing to discuss:
- Should we change the default to
onRankOnly=TRUE
? I tend to think that this would be useful for consistency reasons. The examples could be developed to make clear that both options are available.
Yet another observation.
The order of rows is returned differently in mergeFeatures
(Family2) and mergeFeaturesByRank
(Family3), although the data is otherwise the same:
Here, rownames do not match but the same rownames are in both:
> all(rownames(altExp(tse, "Family2")) == rownames(altExp(tse, "Family3")))
[1] FALSE
> all(rownames(altExp(tse, "Family2")) %in% rownames(altExp(tse, "Family3")))
[1] TRUE
> all(rownames(altExp(tse, "Family3")) %in% rownames(altExp(tse, "Family2")))
[1] TRUE
Also the assay values match when we make sure the rows are in the same order:
> all(assay(altExp(tse, "Family2")) == assay(altExp(tse, "Family3")))
[1] FALSE
> all(assay(altExp(tse, "Family2")[o,]) == assay(altExp(tse, "Family3")[o,]))
[1] TRUE
The
Results from mergeFeatures
are in alphabetical order:
all(rownames(altExp(tse, "Family2"))== sort(rownames(altExp(tse, "Family2"))))
TRUE
The results from mergeFeaturesByRank
are in the same order than in the original rowData, when the NA group is removed:
all(rownames(altExp(tse, "Family3"))== unique(rowData(tse)$Family)[-1])
[1] TRUE
It would be beneficial to harmonize the sorting logic.
The additional thing to discuss:
- Shall we order alphabetically or respect the rowData order? The latter would feel more natural but it might have some issues when
onRankOnly=FALSE
, for instance. On the other hand we do not have clear examples of any real issues with that, it could be possible to just use the rowData order for these cases and later reconsider this shall needs arise. - Shall we review if some other functions have similar sorting issues? I tend to think that it would be good to check on the same go, to maintain internal logic.
Up
if i gather correctly, there are about 4 indications on the changes to be made.
- [ ] The default value of
rank
inmergeFeaturesByRank
should beNULL
- [ ] Add
na.rm
parameter tomergeFeatures
- [ ] Change the default to
onRankOnly=TRUE
inmergeFeaturesByRank
- [ ] Respect the rowData order
Yes, and then if you can also check other functions if they have similar sorting issues
Status of this @Daenarys8 ?
This was fixed