mia Fix critical issue with mergeFeatures

Ping #419 @antagomir @TuomasBorman

[X] The default value of rank in mergeFeaturesByRank should be NULL
[X] Add na.rm parameter to mergeFeatures
[X] Change the default to onRankOnly=TRUE in mergeFeaturesByRank
[X] Respect the rowData order

TODO

[ ] check other functions if they have similar sorting issues
[ ] Add fix on reviews from this PR

Nov 30 '23 15:11 Daenarys8

up

Mar 07 '24 11:03 antagomir

up - @Daenarys8

Apr 02 '24 16:04 antagomir

This does not yet take into account the row order. In mergeRows, the order of data is in alphabetical order. In agglomerateByRanks the order follows the order of rowData. As discussed, the order based on rowData might be the best, but the implementation is harder without big advantages.

So I suggest that we modify agglomerateByRank so that it outputs the data in alphabetical order. I think that is the most straightforward and requires only one additional line in the end of the function.

Apr 10 '24 11:04 TuomasBorman

alphabetical order

by single line, are we referring to? x <- sort(x)

May 06 '24 13:05 Daenarys8

To recap:

1

Behavior of functions that are using agglomerateByRank internally is unexpected. Why? Because if user does not provide rank, the data is agglomerated to the highest rank by default. However, that should not be the case. For example, agglomerateByPrevalence should not do rank-agglomeration by default.

I think our approach is now incorrect. Instead of changing the default value of rank in agglomerateByRank to NULL, we should modify agglomerateByPrevalence.

-->

In agglomererateByPrevalence, catch rank parameter from ..., If rank is specified, call agglomerateByRank.

mergeRows and mergeCols drop off those instances that have NA values in grouping variable. That is because grouping vector is converted to factor. To include instances with NA values as "NA group", NA values could be converted to "NA". -->

this behavior should be controlled somehow, na.rm parameter could be perfect for that, but we should check if na.rm is used somewhere else in the function (fed with ... to some other function)

The default value of onRankOnly in agglomerateByRank should be TRUE by default.

In the end of the agglomerateByRank function, order the data alphabetically based on rownames.

May 06 '24 18:05 TuomasBorman

Sorry for hassle. I think most of the points are not yet implemented.

May 06 '24 18:05 TuomasBorman

Can you run examples of Leo that initiated this https://github.com/microbiome/mia/issues/419 and print the output to here?

May 21 '24 05:05 TuomasBorman

Also bump the versions and add NEWS

May 21 '24 05:05 TuomasBorman

To confirm fix, we will recall the example @antagomir used to raise the issue #419 . Let's first prepare example data.

library(mia)
data(GlobalPatterns, package="mia")
tse <- GlobalPatterns
tse <- transformAssay(tse, assay.type="counts", method="relabundance")

Here we merge features by prevalence, and return the result at the family level.

nrow(mergeFeaturesByPrevalence(tse, rank="Family", assay.type="relabundance", detection  = 0.5/100, prevalence  = 20/100))
[1] 21

Here we merge features first by Family level grouping, then by prevalence.

altExp(tse, "Family") <- mergeFeaturesByRank(tse, rank="Family")
nrow(mergeFeaturesByPrevalence(altExp(tse, "Family"), assay.type="relabundance", detection  = 0.5/100, prevalence  = 20/100))
[1] 21

Same happens when we treat Family as a group, rather thank rank:

altExp(tse, "Family2") <- mergeFeatures(tse, f="Family")
nrow(mergeFeaturesByPrevalence(altExp(tse, "Family2"), assay.type="relabundance", detection  = 0.5/100, prevalence  = 20/100))
[1] 21

Finally, checking the prevalences manually yields the same numbers.

sum(rowMeans(assay(altExp(tse, "Family"), "relabundance") > 0.5/100) > 0.2)
[1] 20

sum(rowMeans(assay(altExp(tse, "Family2"), "relabundance") > 0.5/100) > 0.2)
[1] 20

May 21 '24 06:05 Daenarys8

Thanks!

The last two examples yield a different number (20) - shouldn't it be the same?

Can we have these in unit tests, to make sure it will remain stable also in future releases?

May 21 '24 07:05 antagomir

I believe mergeFeaturesByPrevalence() creates a group called others. "others" have all the features under the thresholds. This might explain the behavior.

Can you @Daenarys8 check and create unit tests. They should explain this behavior. For instance if you compare these, you could add comment "other group was removed. It is added by agglomerateByPrevalence function to collect features under threshold." or something like that so that we know in the future what is happening and that it is desired behavior

May 21 '24 10:05 TuomasBorman

The code should also show how altExp(tse, "Family") was created. There are different options (onRankOnly etc).

May 21 '24 10:05 antagomir

   data(GlobalPatterns, package="mia")
    tse <- GlobalPatterns
    tse <- transformAssay(tse, assay.type="counts", method="relabundance")

Other group not present

    altExp(tse, "Family") <- agglomerateByRank(tse, rank="Family")
    altExp(tse, "Family1") <- agglomerateByRank(tse, rank="Family", onRankOnly = TRUE)
    altExp(tse, "Family2") <- agglomerateByRank(tse, rank="Family", onRankOnly = FALSE)
    altExp(tse, "Family3") <- agglomerateByRank(tse, rank="Family", onRankOnly = TRUE, na.rm = TRUE)
    altExp(tse, "Family4") <- agglomerateByRank(tse, rank="Family", onRankOnly = TRUE, na.rm = FALSE)
    altExp(tse, "Family5") <- agglomerateByVariable(tse, f="Family", MARGIN = 'row')

In the following, other group is added by agglomerateByPrevalence function to collect features under threshold

    actual <- agglomerateByPrevalence(tse, rank="Family", assay.type="relabundance", 
                                         detection  = 0.5/100, prevalence  = 20/100)
    actual0 <- agglomerateByPrevalence(altExp(tse, "Family"), assay.type="relabundance", 
                                       detection  = 0.5/100, prevalence  = 20/100)
    actual1 <- agglomerateByPrevalence(altExp(tse, "Family1"), assay.type="relabundance", 
                                       detection  = 0.5/100, prevalence  = 20/100)
    actual2 <- agglomerateByPrevalence(altExp(tse, "Family2"), assay.type="relabundance", 
                                       detection  = 0.5/100, prevalence  = 20/100)
    actual3 <- agglomerateByPrevalence(altExp(tse, "Family3"), assay.type="relabundance", 
                                       detection  = 0.5/100, prevalence  = 20/100)
    actual4 <- agglomerateByPrevalence(altExp(tse, "Family4"), assay.type="relabundance", 
                                       detection  = 0.5/100, prevalence  = 20/100)
    actual5 <- agglomerateByPrevalence(altExp(tse, "Family5"), assay.type="relabundance", 
                                       detection  = 0.5/100, prevalence  = 20/100)

> nrow(actual)
[1] 21
> nrow(actual1)
[1] 21
> nrow(actual2)
[1] 27

actual2 create groups based on the full taxonomic hierarchy up to family level while the previous creates the factor group with groups at the family level.

> nrow(actual3)
[1] 20
> nrow(actual4)
[1] 21
> nrow(actual5)
[1] 21
>

> sum(rowMeans(assay(altExp(tse, "Family"), "relabundance") > 0.5/100) > 0.2)
[1] 20
> sum(rowMeans(assay(altExp(tse, "Family1"), "relabundance") > 0.5/100) > 0.2)
[1] 20
> sum(rowMeans(assay(altExp(tse, "Family2"), "relabundance") > 0.5/100) > 0.2)
[1] 26
> sum(rowMeans(assay(altExp(tse, "Family3"), "relabundance") > 0.5/100) > 0.2)
[1] 19
> sum(rowMeans(assay(altExp(tse, "Family4"), "relabundance") > 0.5/100) > 0.2)
[1] 20
> sum(rowMeans(assay(altExp(tse, "Family5"), "relabundance") > 0.5/100) > 0.2)
[1] 20
>

May 22 '24 09:05 Daenarys8

mia mia copied to clipboard

Fix critical issue with mergeFeatures

mia
mia copied to clipboard