A couple of algorithms for your consideration
First, what I've been calling the Dr Jacques method:
static int m = 1, r = 0;
int bounded_rand(int n) {
const int N = INT_MAX >> 1;
int q, d;
for (;;) {
while (m < N) {
// needs refactoring to consume 32-bit chunks
r = (r << 1) | random_bit();
m <<= 1;
}
q = m / n;
if (r < q * n)
break;
r -= n * q;
m -= n * q;
}
d = r % n;
r /= q;
m = q;
return d;
}
This is an algorithm which doesn't do much to avoid division, but it is frugal with its use of the RNG. If you include a CSRNG in the pool of generators to test against, then this may win.
The other is my own:
static uint32_t bounded_rand(rng_t& rng, uint32_t range) {
uint64_t m = uint64_t(range) << 31; // zero is fine too.
for (int i = 0; i < 3; ++i) {
uint32_t x = rng();
m = uint64_t(x) * uint64_t(range) + (m >> 32);
}
return m >> 32;
}
This is biased, but the bias is extremely small. Change 3 to whatever you need. And it's branch-free on top of being division-free, provided the loop is unrolled.
If you analyse it you'll see it's simply a long-multiplication version of the standard multiply-and-shift bounded_rand(). Since range is only a 32-bit value there's only one carry path to follow and the rest of the result is discarded on the fly.
Fun fact: it's also the same construction as Marsaglia's multiply-with-carry generator. I believe there's a lag-r MWC construction analogue which should use fewer calls to rng() if anybody cared about that.
Actually, I think lag-2 is just this?
static uint32_t bounded_rand(rng_t& rng, uint32_t range) {
static uint32_t saved = 0x80000000u;
uint64_t m = uint64_t(saved) * uint64_t(range);
uint32_t x = rng();
m = uint64_t(x) * uint64_t(range) + (m >> 32);
saved = uint32_t(m);
return m >> 32;
}
The bits in saved are shifted down 32 bits every call, which should wash out the effect of multiplies pushing bias up from the bottom, right?