ryu Investigate faster performance than upstream implementation

As observed in https://github.com/dtolnay/ryu/pull/34#issue-496690273, the 32-bit and 64-bit float-to-string benchmark in this repo is something like 40% faster than the exact same benchmark in https://github.com/ulfjack/ryu. We need to examine whether there is any inadvertent discrepancy between the two benchmarks, difference in build configuration, or any toolchain difference that would account for this. The code is a line-by-line transliteration so there should really be no performance difference, and certainly not this big.

Oct 02 '20 06:10 dtolnay

I've bisected it and found the culprit (https://github.com/ulfjack/ryu/commit/b94f5d00890051c25a5f7467f554e10a5b09a930), apparently they changed the default benchmark mode from "classic" to "inverted". To use the classic mode you need to add -classic to the options, such as

bazel run -c opt //ryu/benchmark:ryu_benchmark -- -classic

I personally see 3 options:

Add both benchmarks modes to this repo
Change from classic mode to inverted mode
Change instructions to run the upstream benchmark with the classic option

Nov 30 '20 04:11 pheki

https://github.com/ulfjack/ryu/pull/72#issue-207751594

Normally, the benchmark generates samples in an outer loop, and then it tests each sample in an inner loop for iterations. This allows mean and variance statistics to be collected for each sample, which is useful because some samples exercise different codepaths.

However, repeatedly testing each sample causes Ryu's branches to be highly predictable, which is unrealistic for the real world.

The new "-invert" option generates samples into a std::vector (to keep the Mersenne Twister out of the profiling). Then it has an outer loop for iterations, and its inner loop tests each sample once. This has realistic branch (mis)prediction characteristics.

Note that when inverting, we divide the t2 - t1 duration (which is the total time taken to convert all samples in the vector) by samples, so the delta is the average time taken to convert a sample. This means that the printed Average values are comparable between normal mode and inverted mode (and the difference shows how costly branch misprediction is). The printed Stddev values aren't comparable, though - in normal mode, they measure the variation between samples (which is nonzero, even for an ideal machine, because Ryu does different amounts of work for different inputs), while in inverted mode, they measure the variation between each vector loop (which would ideally be zero for a perfectly deterministic machine).

The justification for taking the formerly-"inverted" mode as the default is compelling -- it hits realistic branch misprediction characteristics.

I would accept a PR to implement both modes (matching upstream) and require opt-in for "classic" mode (matching upstream).

Nov 30 '20 04:11 dtolnay