fastrand icon indicating copy to clipboard operation
fastrand copied to clipboard

Performance optimizations

Open piegamesde opened this issue 3 years ago • 7 comments

Randomly consistently shows up as a major bottleneck in my application. My basic problem is that I want to retrieve a lot of random elements of a Vec. I noticed a few things and would like to discuss them:

  • I did something like index = if len < 65536 { rng.u16(0..len) } else { rng.usize(0..len) } and it did seem to help with the performance.
  • I don't actually need the lower range bound, and in a lot of cases I need the full range of the value space anyways. Would adding explicit API for these use case increase performance? (I don't really know how RNGs work internally)

piegamesde avatar Mar 12 '21 22:03 piegamesde

I did something like index = if len < 65536 { rng.u16(0..len) } else { rng.usize(0..len) } and it did seem to help with the performance.

This is probably because the algorithm only generates u32s, and generating a usize (which I assume is a u64 for you) requires generating two u32s. My PR #14 switches to Wyrand, which generates u64s so shouldn't have this problem. It is also faster in other ways.

I don't actually need the lower range bound, and in a lot of cases I need the full range of the value space anyways. Would adding explicit API for these use case increase performance?

I ran a benchmark and a hypothetical full-range API has the exact same performance as calling the function with a .. range, so you can just use that.

SabrinaJewson avatar Mar 13 '21 13:03 SabrinaJewson

If I understand you correctly, this means that generating u16 is faster than usize (maybe two times?), but as fast as generating u32. I could get twice as fast by splitting the u32 into two u16.

If this is true, this also means that using repeat_with(|| rng.u8(..)).take(10_000).collect(); to generate random byte vectors could be made 4 times faster by generating u32 and transmuting them to a byte slice.

piegamesde avatar Mar 13 '21 14:03 piegamesde

Yes, I think that's all correct with the current implementation. It is two times faster to generate a u32 and then split it than to generate two u16s.

You don't need to transmute at all in your iterator. On beta and nightly (and stable in 12 days) you can write:

repeat_with(|| rng.u32(..))
    .flat_map(|v| array::IntoIter::new(v.to_ne_bytes()))
    .take(10_000)
    .collect()

or on stable until then:

repeat_with(|| rng.u32(..).to_ne_bytes())
    .flat_map(|bytes| (0..4).map(move |i| bytes[i]))
    .take(10_000)
    .collect()

SabrinaJewson avatar Mar 13 '21 15:03 SabrinaJewson

One would have to check if the iterator implementations compile down to a transmute or something similarly fast. Also, maybe it's worth adding a convenience method for bulk-filling random data array? (Not sure about how the API should look like though.)

Wasting half (or more) of the generated random values is something I think should easily be avoidable by storing the unused random bytes for future usage instead of throwing them away. Using a separate store for u8, u16 (and later on, u32) should make the branches easily predictable. Does this crate make any guarantees about the values that a certain seed returns across versions?

piegamesde avatar Mar 13 '21 16:03 piegamesde

Wasting half (or more) of the generated random values is something I think should easily be avoidable by storing the unused random bytes for future usage instead of throwing them away.

I don't really think this is worth it. Especially if #14 gets merged the actual generation becomes really cheap, and caching values would require both adding branches and increasing the size of the RNG. Also, I don't know of any other RNG implementations that do this.

Does this crate make any guarantees about the values that a certain seed returns across versions?

It's not explicitly specified anywhere, but I would assume not, otherwise there would never be any possibility to change the RNG algorithm.

SabrinaJewson avatar Mar 13 '21 16:03 SabrinaJewson

The new Rng algorithm is indeed faster, but not by orders of magnitude (about 15% maybe?). Interestingly, generating usize is still slower than u8/u16 for some reason.

piegamesde avatar Mar 13 '21 20:03 piegamesde

I have a theory the performance discrepancy might be due to way the code is written not the library. There is a branch in the middle of what looks like a hot loop and this can slow things down surprisingly. If there was only one code path the CPU can work ahead on computing the random index. With the if statement the CPU has to predict which path it will take to stay at full speed. If it predicts wrong it will slow down. Only generating usizes would likely help with performance also considering WyRand generates u64s internally.

ironhaven avatar Nov 21 '21 21:11 ironhaven