ryu
ryu copied to clipboard
Improvement: Compare with more algorithms in benchmark
see Milo Yips / C++ double-to-string conversion benchmark https://github.com/miloyip/dtoa-benchmark
We'd first have to support fixed-precision output, which the code does not right now (although it shouldn't be hard to add, see also #27).
Generally speaking, this is not a benchmark repository. Maybe it would be better to add a benchmark for Ryu over there, rather than adding a lot of benchmarks here.
Maybe it would be better to add a benchmark for Ryu over there, rather than adding a lot of benchmarks here.
i asked milo about adding your function to his benchmark suite - maybe you can add a reference to his tests (when integrated)
I benchmarked my to_chars() implementation powered by Ryu (which differs from upstream in that it's bounds-checked, follows sprintf() conventions for exponent style, and pays additional overhead for chars_format and centralized INF/NAN logic; Ryu itself is strictly faster) against dtoa_milo(). Ryu is 70% to 90% faster, and additionally rounds correctly. See: https://www.reddit.com/r/cpp/comments/9fiko6/fmt_version_52_released_with_up_to_5x_speedups/e5xmve6/
Perhaps this issue should be resolved, since this is indeed not a benchmark repository and it doesn't seem that doing anything else here would be a valuable use of time? While it appears that double-conversion is not the fastest possible Grisu implementation, it has the advantage of matching Ryu's correctly rounded output, and Ryu indeed appears to be the fastest in the world by a wide margin, so the choice of comparison algorithm seems unimportant (except for "promotional" purposes). In my optimization efforts, I've focused on improving Ryu's absolute performance; the relative performance (especially against sprintf()) is merely entertaining rather than useful.
I think here should be sets of parameterized benchmarks , at least to track progress through changes and drive experiments.
Jsoniter-scala (one of first adopters of Ryu) has a growing set of benchmarks from the beginning. It helped a lot to drive improvements using different profilers and to spot regressions on new JVMs. Here are results from the latest run:
https://plokhotnyuk.github.io/jsoniter-scala/
E. g. a lot of mitigations where done to avoid 3x times slowdown regression with Graal due missing compiler optimization for division by constant:
https://github.com/oracle/graal/issues/593
Take look to my dtoa-benchmark fork.
$ git clone https://github.com/leo-yuriev/dtoa-benchmark
$ cd dtoa-benchmark
$ cmake . && cmake --build .
C/C++ results sorted by RMS:
$ ./d2a-benchmark | grep 'Bench' | LC_ALL=C sort -k 1.70
Benchmarking randomdigit null ... [min 1.500ns, rms 1.500ns, max 1.500ns]
Benchmarking randomdigit erthink ... [min 22.700ns, rms 43.333ns, max 58.500ns]
Benchmarking randomdigit ryu ... [min 41.200ns, rms 55.637ns, max 65.700ns]
Benchmarking randomdigit milo ... [min 34.400ns, rms 58.060ns, max 71.800ns]
Benchmarking randomdigit emyg ... [min 36.100ns, rms 58.945ns, max 72.800ns]
Benchmarking randomdigit floaxie ... [min 27.100ns, rms 71.038ns, max 94.200ns]
Benchmarking randomdigit grisu2 ... [min 74.700ns, rms 90.063ns, max 107.900ns]
Benchmarking randomdigit doubleconv ... [min 100.500ns, rms 139.102ns, max 165.800ns]
Benchmarking randomdigit fmt ... [min 117.000ns, rms 147.098ns, max 171.800ns]
Benchmarking randomdigit fpconv ... [min 99.900ns, rms 149.201ns, max 173.100ns]
Benchmarking randomdigit sprintf ... [min 804.900ns, rms 902.667ns, max 980.300ns]
Benchmarking randomdigit ostrstream ... [min 1192.300ns, rms 1292.424ns, max 1375.400ns]
Benchmarking randomdigit ostringstream ... [min 1249.000ns, rms 1358.195ns, max 1451.100ns]
$ gcc --version
gcc (Ubuntu 8.3.0-6ubuntu1~18.04) 8.3.0
...
@leo-yuriev could you, please, update your charts for different number of digits to see results of ryu and erthink in them?
@plokhotnyuk, today I spent some time to Refine the benchmark to make it easier and more convenient to get a results (I pushed those changes for now). And noticed that in different modes of the benchmark results 'ryu' and 'erthink' are significantly different.
RandomDigit: Generates 1000 random double values, filtered out +/-inf and nan. Then convert them to limited precision (1 to 17 decimal digits in significand). Finally convert these numbers into ASCII.
| Function | Min ns | RMS ns | Max ns | Sum ns | Speedup |
|---|---|---|---|---|---|
| erthink | 21.0 | 183.412 | 60.0 | 733.5 | 21.00 |
| ryu | 42.1 | 237.773 | 74.7 | 966.6 | 15.94 |
| milo | 33.5 | 257.925 | 77.4 | 1046.7 | 14.72 |
| floaxie | 27.6 | 304.451 | 96.9 | 1198.3 | 12.86 |
| emyg | 38.4 | 296.706 | 86.3 | 1204.1 | 12.79 |
| grisu2 | 71.2 | 369.677 | 107.5 | 1515.9 | 10.16 |
| doubleconv | 90.7 | 513.139 | 167.8 | 2085.4 | 7.39 |
| fmt | 95.7 | 523.818 | 150.4 | 2148.8 | 7.17 |
| fpconv | 99.2 | 649.942 | 180.6 | 2650.6 | 5.81 |
| sprintf | 828.5 | 3740.842 | 965.3 | 15405.2 | 1.00 |
| ostrstream | 1217.2 | 5354.421 | 1359.8 | 22062.9 | 0.70 |
| ostringstream | 1274.9 | 5628.034 | 1434.8 | 23186.3 | 0.66 |
The "sequental" testcase: Actually it converts 10000 numbers with random mantissa for each decimal exponent in range 0...16. So, I think we can treats it as a random numbers with some known distribution/bias of an exponent.
| Function | Min ns | RMS ns | Max ns | Sum ns | Speedup |
|---|---|---|---|---|---|
| ryu | 14.7 | 90.250 | 25.8 | 367.8 | 21.48 |
| erthink | 18.0 | 135.354 | 46.4 | 543.9 | 14.53 |
| milo | 21.0 | 149.139 | 46.8 | 604.0 | 13.08 |
| emyg | 21.1 | 149.400 | 46.7 | 604.9 | 13.06 |
| grisu2 | 55.0 | 285.169 | 81.0 | 1168.3 | 6.76 |
| floaxie | 32.8 | 300.912 | 95.8 | 1189.6 | 6.64 |
| doubleconv | 54.5 | 319.500 | 106.3 | 1297.9 | 6.09 |
| fmt | 81.7 | 413.348 | 114.7 | 1696.6 | 4.66 |
| fpconv | 90.4 | 539.581 | 156.5 | 2179.6 | 3.62 |
| sprintf | 230.0 | 2020.639 | 709.7 | 7901.0 | 1.00 |
| ostrstream | 585.7 | 3429.470 | 1062.9 | 13880.7 | 0.57 |
| ostringstream | 571.1 | 3480.388 | 1130.1 | 14048.8 | 0.56 |
So, my preliminary findings are:
- erthink is faster when value could be represented by 12 or less digits (I checked this by changing range of this loop);
- ryu is faster in the all other cases;
P.S. The "RMS" in tables actually is a sqrt(sum of squares), i.e. has not normalized for number of values..