matrixStats icon indicating copy to clipboard operation
matrixStats copied to clipboard

rowDuplicated() and rowAnyDuplicated()

Open karoliskoncevicius opened this issue 5 years ago • 4 comments

This issue is a question / feature request.

Do you think it would make sense to add functions like duplicated() and anyDuplicated() optimized to work on every row/column to this package?

karoliskoncevicius avatar Jun 24 '19 23:06 karoliskoncevicius

I was looking for this today...

MLopez-Ibanez avatar Oct 11 '22 13:10 MLopez-Ibanez

matrixStats is primarily intended for numerical operations on matrices, not dataframe-like operations such as duplicated(). Besides, it would only work reliably for integer matrices because double matrices suffer from floating point imprecision.

yaccos avatar Oct 11 '22 13:10 yaccos

matrixStats is primarily intended for numerical operations on matrices, not dataframe-like operations such as duplicated(). Besides, it would only work reliably for integer matrices because double matrices suffer from floating point imprecision.

It could have a tolerance parameter that defaults to sqrt(.Machine$double.eps) like all.equal(). There are many numerical operations where being able to detect duplicated vectors (or close to duplicated vectors) would be useful.

MLopez-Ibanez avatar Nov 20 '22 22:11 MLopez-Ibanez

@yaccos I would not be quick to agree that duplicated() is a data.frame-like operation. Sure it works on entries of data.frame but it is also used to test if there are repeating values in a vector - this use is what I have in mind here.

We can easily have matrices of counts or ranks. matrixStats itself has rowRanks() and rowCounts(). Then checking if there are duplicates in rows/columns might be necessary. Non-parametric tests such as Mann-Whitney test is one example.

karoliskoncevicius avatar Feb 27 '23 20:02 karoliskoncevicius