mia
mia copied to clipboard
Fix critical issue with mergeFeatures
Ping #419 @antagomir @TuomasBorman
- [X] The default value of rank in mergeFeaturesByRank should be NULL
- [X] Add na.rm parameter to mergeFeatures
- [X] Change the default to onRankOnly=TRUE in mergeFeaturesByRank
- [X] Respect the rowData order
TODO
- [ ] check other functions if they have similar sorting issues
- [ ] Add fix on reviews from this PR
up
up - @Daenarys8
This does not yet take into account the row order. In mergeRows, the order of data is in alphabetical order. In agglomerateByRanks the order follows the order of rowData. As discussed, the order based on rowData might be the best, but the implementation is harder without big advantages.
So I suggest that we modify agglomerateByRank so that it outputs the data in alphabetical order. I think that is the most straightforward and requires only one additional line in the end of the function.
alphabetical order
by single line, are we referring to? x <- sort(x)
To recap:
1
Behavior of functions that are using agglomerateByRank internally is unexpected. Why? Because if user does not provide rank, the data is agglomerated to the highest rank by default. However, that should not be the case. For example, agglomerateByPrevalence should not do rank-agglomeration by default.
I think our approach is now incorrect. Instead of changing the default value of rank in agglomerateByRank to NULL, we should modify agglomerateByPrevalence.
-->
In agglomererateByPrevalence, catch rank parameter from ...
, If rank is specified, call agglomerateByRank.
mergeRows and mergeCols drop off those instances that have NA values in grouping variable. That is because grouping vector is converted to factor. To include instances with NA values as "NA group", NA values could be converted to "NA". -->
this behavior should be controlled somehow, na.rm
parameter could be perfect for that, but we should check if na.rm is used somewhere else in the function (fed with ...
to some other function)
The default value of onRankOnly in agglomerateByRank should be TRUE by default.
In the end of the agglomerateByRank function, order the data alphabetically based on rownames.
Sorry for hassle. I think most of the points are not yet implemented.
Can you run examples of Leo that initiated this https://github.com/microbiome/mia/issues/419 and print the output to here?
Also bump the versions and add NEWS
To confirm fix, we will recall the example @antagomir used to raise the issue #419 . Let's first prepare example data.
library(mia)
data(GlobalPatterns, package="mia")
tse <- GlobalPatterns
tse <- transformAssay(tse, assay.type="counts", method="relabundance")
Here we merge features by prevalence, and return the result at the family level.
nrow(mergeFeaturesByPrevalence(tse, rank="Family", assay.type="relabundance", detection = 0.5/100, prevalence = 20/100))
[1] 21
Here we merge features first by Family level grouping, then by prevalence.
altExp(tse, "Family") <- mergeFeaturesByRank(tse, rank="Family")
nrow(mergeFeaturesByPrevalence(altExp(tse, "Family"), assay.type="relabundance", detection = 0.5/100, prevalence = 20/100))
[1] 21
Same happens when we treat Family as a group, rather thank rank:
altExp(tse, "Family2") <- mergeFeatures(tse, f="Family")
nrow(mergeFeaturesByPrevalence(altExp(tse, "Family2"), assay.type="relabundance", detection = 0.5/100, prevalence = 20/100))
[1] 21
Finally, checking the prevalences manually yields the same numbers.
sum(rowMeans(assay(altExp(tse, "Family"), "relabundance") > 0.5/100) > 0.2)
[1] 20
sum(rowMeans(assay(altExp(tse, "Family2"), "relabundance") > 0.5/100) > 0.2)
[1] 20
Thanks!
The last two examples yield a different number (20) - shouldn't it be the same?
Can we have these in unit tests, to make sure it will remain stable also in future releases?
I believe mergeFeaturesByPrevalence() creates a group called others. "others" have all the features under the thresholds. This might explain the behavior.
Can you @Daenarys8 check and create unit tests. They should explain this behavior. For instance if you compare these, you could add comment "other group was removed. It is added by agglomerateByPrevalence function to collect features under threshold." or something like that so that we know in the future what is happening and that it is desired behavior
The code should also show how altExp(tse, "Family") was created. There are different options (onRankOnly etc).
data(GlobalPatterns, package="mia")
tse <- GlobalPatterns
tse <- transformAssay(tse, assay.type="counts", method="relabundance")
Other group not present
altExp(tse, "Family") <- agglomerateByRank(tse, rank="Family")
altExp(tse, "Family1") <- agglomerateByRank(tse, rank="Family", onRankOnly = TRUE)
altExp(tse, "Family2") <- agglomerateByRank(tse, rank="Family", onRankOnly = FALSE)
altExp(tse, "Family3") <- agglomerateByRank(tse, rank="Family", onRankOnly = TRUE, na.rm = TRUE)
altExp(tse, "Family4") <- agglomerateByRank(tse, rank="Family", onRankOnly = TRUE, na.rm = FALSE)
altExp(tse, "Family5") <- agglomerateByVariable(tse, f="Family", MARGIN = 'row')
In the following, other group is added by agglomerateByPrevalence function to collect features under threshold
actual <- agglomerateByPrevalence(tse, rank="Family", assay.type="relabundance",
detection = 0.5/100, prevalence = 20/100)
actual0 <- agglomerateByPrevalence(altExp(tse, "Family"), assay.type="relabundance",
detection = 0.5/100, prevalence = 20/100)
actual1 <- agglomerateByPrevalence(altExp(tse, "Family1"), assay.type="relabundance",
detection = 0.5/100, prevalence = 20/100)
actual2 <- agglomerateByPrevalence(altExp(tse, "Family2"), assay.type="relabundance",
detection = 0.5/100, prevalence = 20/100)
actual3 <- agglomerateByPrevalence(altExp(tse, "Family3"), assay.type="relabundance",
detection = 0.5/100, prevalence = 20/100)
actual4 <- agglomerateByPrevalence(altExp(tse, "Family4"), assay.type="relabundance",
detection = 0.5/100, prevalence = 20/100)
actual5 <- agglomerateByPrevalence(altExp(tse, "Family5"), assay.type="relabundance",
detection = 0.5/100, prevalence = 20/100)
> nrow(actual)
[1] 21
> nrow(actual1)
[1] 21
> nrow(actual2)
[1] 27
actual2 create groups based on the full taxonomic hierarchy up to family level while the previous creates the factor group with groups at the family level.
> nrow(actual3)
[1] 20
> nrow(actual4)
[1] 21
> nrow(actual5)
[1] 21
>
> sum(rowMeans(assay(altExp(tse, "Family"), "relabundance") > 0.5/100) > 0.2)
[1] 20
> sum(rowMeans(assay(altExp(tse, "Family1"), "relabundance") > 0.5/100) > 0.2)
[1] 20
> sum(rowMeans(assay(altExp(tse, "Family2"), "relabundance") > 0.5/100) > 0.2)
[1] 26
> sum(rowMeans(assay(altExp(tse, "Family3"), "relabundance") > 0.5/100) > 0.2)
[1] 19
> sum(rowMeans(assay(altExp(tse, "Family4"), "relabundance") > 0.5/100) > 0.2)
[1] 20
> sum(rowMeans(assay(altExp(tse, "Family5"), "relabundance") > 0.5/100) > 0.2)
[1] 20
>