Rcpp sugar indexHash (unique, sort_unique, etc.) distinguish signed and unsigned zeros
Hi Dirk, sorry if this is not exactly as much effort as you expect for an issue, I just wanted to flag something reported to collapse (#648), which is present in both my hash functions written in C and your hash functions, and that is the following:
library(collapse)
#> Warning: package 'collapse' was built under R version 4.3.3
#> collapse 2.0.17, see ?`collapse-package` or ?`collapse-documentation`
x = round(rnorm(100))
unique(x) # R
#> [1] 1 0 2 -1 -2
funique(x) # My hash function in C
#> [1] 1 0 0 2 -1 -2
funique(x, sort = TRUE) # Rcpp::sugar::sort_unique()
#> [1] -2 -1 0 0 1 2
# More explicit proof
collapse:::sortuniqueCpp(x)
#> [1] -2 -1 0 0 1 2
# The solution
y = x + 0L
funique(y)
#> [1] 1 0 2 -1 -2
collapse:::sortuniqueCpp(y)
#> [1] -2 -1 0 1 2
Created on 2024-10-31 with reprex v2.0.2
In words: R functions like round() create signed and unsigned zeros, whose hashes differ. A quite efficient remedy is to add an integer zero (gives like a 3% slower execution on my very efficient C hash). I'm considering to roll this out, but of course cannot control your code. So just pushing it to you as food for thought.
Thanks for the note. We will ponder this, Making a change may risk upsetting existing code, but documenting your workaround of adding an explicit zero seems like a good idea.
I'd be surprised if adding a zero breaks any code - except for this particular issue which is clearly not meant to be as far as R is concerned. But yeah, haven't checked yet what base R does to avoid sign bits with zeros - probably something smarter but slower...
This issue is stale (365 days without activity) and will be closed in 31 days unless new activity is seen. Please feel free to re-open it is still a concern, possibly with additional data.
For reference, this is what R does: https://github.com/wch/r-source/blob/3f7e2528990351bc4b0d1f1b237714668ab4c0c5/src/main/unique.c#L136-L138
We could try that.
Adding zero also works and is faster, e.g.: https://github.com/SebKrantz/collapse/compare/master...zero_dups
However, I haven't implemented this either yet, as the speed penalty is noticeable...
But yes, we should probably do this.
I wonder if there is a way to do this using bitwise operations – that would probably be the cleanest and fastest way.
I am no compiler expert but looking at your diff I would suspect the 'empty' addition might get optimised away?
Edit: No it doesn't. See below.
No idea, but it works on clang (Mac).
Yup.
#include <cmath>
#include <cstdio>
double fixup(double d) {
return d + 0.0;
}
double fixup2(double d) {
return (d == 0.0 ? 0.0 : d);
}
int main(int argc, char *argv[]) {
double zn = 0.0 * -1.0;
double zp = +0.0;
if (zn == zp) std::puts("equal");
if (std::signbit(zn) == std::signbit(zp)) std::puts("signed bit equal");
if (fixup(zn) == fixup(zp)) std::puts("fixup equal");
if (std::signbit(fixup(zn)) == std::signbit(fixup(zp))) std::puts("fixup2 signed bit equal");
if (fixup2(zn) == fixup2(zp)) std::puts("fixup2 equal");
if (std::signbit(fixup2(zn)) == std::signbit(fixup2(zp))) std::puts("fixup2 signed bit equal");
exit(0);
}
yielding
$ g++ -O3 -Wall -pedantic -o zero zero.cpp
$ ./zero
equal
fixup equal
fixup2 signed bit equal
fixup2 equal
fixup2 signed bit equal
$
Works even on real computers. Both ways. I like the second way better.
It turns out we are trying to do the right thing:
https://github.com/RcppCore/Rcpp/blob/9f07c76116b47ff1f0e7e11e72f7a4876a3c721e/inst/include/Rcpp/hash/IndexHash.h#L214-L217
but then this part messes things up:
https://github.com/RcppCore/Rcpp/blob/9f07c76116b47ff1f0e7e11e72f7a4876a3c721e/inst/include/Rcpp/hash/IndexHash.h#L170
because that comparison doesn't fix the sign. Patch coming.