Rcpp icon indicating copy to clipboard operation
Rcpp copied to clipboard

Rcpp sugar indexHash (unique, sort_unique, etc.) distinguish signed and unsigned zeros

Open SebKrantz opened this issue 1 year ago • 2 comments

Hi Dirk, sorry if this is not exactly as much effort as you expect for an issue, I just wanted to flag something reported to collapse (#648), which is present in both my hash functions written in C and your hash functions, and that is the following:

library(collapse)
#> Warning: package 'collapse' was built under R version 4.3.3
#> collapse 2.0.17, see ?`collapse-package` or ?`collapse-documentation`

x = round(rnorm(100))
unique(x)               # R
#> [1]  1  0  2 -1 -2
funique(x)              # My hash function in C
#> [1]  1  0  0  2 -1 -2
funique(x, sort = TRUE) # Rcpp::sugar::sort_unique()
#> [1] -2 -1  0  0  1  2
# More explicit proof
collapse:::sortuniqueCpp(x)
#> [1] -2 -1  0  0  1  2

# The solution
y = x + 0L

funique(y)              
#> [1]  1  0  2 -1 -2
collapse:::sortuniqueCpp(y)
#> [1] -2 -1  0  1  2

Created on 2024-10-31 with reprex v2.0.2

In words: R functions like round() create signed and unsigned zeros, whose hashes differ. A quite efficient remedy is to add an integer zero (gives like a 3% slower execution on my very efficient C hash). I'm considering to roll this out, but of course cannot control your code. So just pushing it to you as food for thought.

SebKrantz avatar Oct 31 '24 10:10 SebKrantz

Thanks for the note. We will ponder this, Making a change may risk upsetting existing code, but documenting your workaround of adding an explicit zero seems like a good idea.

eddelbuettel avatar Oct 31 '24 12:10 eddelbuettel

I'd be surprised if adding a zero breaks any code - except for this particular issue which is clearly not meant to be as far as R is concerned. But yeah, haven't checked yet what base R does to avoid sign bits with zeros - probably something smarter but slower...

SebKrantz avatar Oct 31 '24 13:10 SebKrantz

This issue is stale (365 days without activity) and will be closed in 31 days unless new activity is seen. Please feel free to re-open it is still a concern, possibly with additional data.

github-actions[bot] avatar Nov 01 '25 02:11 github-actions[bot]

For reference, this is what R does: https://github.com/wch/r-source/blob/3f7e2528990351bc4b0d1f1b237714668ab4c0c5/src/main/unique.c#L136-L138

Enchufa2 avatar Nov 01 '25 08:11 Enchufa2

We could try that.

eddelbuettel avatar Nov 01 '25 11:11 eddelbuettel

Adding zero also works and is faster, e.g.: https://github.com/SebKrantz/collapse/compare/master...zero_dups

However, I haven't implemented this either yet, as the speed penalty is noticeable...

But yes, we should probably do this.

SebKrantz avatar Nov 01 '25 15:11 SebKrantz

I wonder if there is a way to do this using bitwise operations – that would probably be the cleanest and fastest way.

SebKrantz avatar Nov 01 '25 16:11 SebKrantz

I am no compiler expert but looking at your diff I would suspect the 'empty' addition might get optimised away?

Edit: No it doesn't. See below.

eddelbuettel avatar Nov 01 '25 16:11 eddelbuettel

No idea, but it works on clang (Mac).

SebKrantz avatar Nov 01 '25 16:11 SebKrantz

Yup.

#include <cmath>
#include <cstdio>

double fixup(double d) {
    return d + 0.0;
}

double fixup2(double d) {
  return (d == 0.0 ? 0.0 : d);
}

int main(int argc, char *argv[]) {
  double zn = 0.0 * -1.0;
  double zp = +0.0;
  if (zn == zp) std::puts("equal");
  if (std::signbit(zn) == std::signbit(zp)) std::puts("signed bit equal");
  if (fixup(zn) == fixup(zp)) std::puts("fixup equal");
  if (std::signbit(fixup(zn)) == std::signbit(fixup(zp))) std::puts("fixup2 signed bit equal");
  if (fixup2(zn) == fixup2(zp)) std::puts("fixup2 equal");
  if (std::signbit(fixup2(zn)) == std::signbit(fixup2(zp))) std::puts("fixup2 signed bit equal");
  exit(0);
}

yielding

$ g++ -O3 -Wall -pedantic -o zero zero.cpp
$ ./zero 
equal
fixup equal
fixup2 signed bit equal
fixup2 equal
fixup2 signed bit equal
$ 

Works even on real computers. Both ways. I like the second way better.

eddelbuettel avatar Nov 01 '25 16:11 eddelbuettel

It turns out we are trying to do the right thing:

https://github.com/RcppCore/Rcpp/blob/9f07c76116b47ff1f0e7e11e72f7a4876a3c721e/inst/include/Rcpp/hash/IndexHash.h#L214-L217

but then this part messes things up:

https://github.com/RcppCore/Rcpp/blob/9f07c76116b47ff1f0e7e11e72f7a4876a3c721e/inst/include/Rcpp/hash/IndexHash.h#L170

because that comparison doesn't fix the sign. Patch coming.

Enchufa2 avatar Nov 01 '25 19:11 Enchufa2