ryu Improvement: Compare with more algorithms in benchmark

trafficstars

see Milo Yips / C++ double-to-string conversion benchmark https://github.com/miloyip/dtoa-benchmark

Jul 28 '18 06:07 LowLevelMahn

We'd first have to support fixed-precision output, which the code does not right now (although it shouldn't be hard to add, see also #27).

Generally speaking, this is not a benchmark repository. Maybe it would be better to add a benchmark for Ryu over there, rather than adding a lot of benchmarks here.

Jul 28 '18 13:07 ulfjack

Maybe it would be better to add a benchmark for Ryu over there, rather than adding a lot of benchmarks here.

i asked milo about adding your function to his benchmark suite - maybe you can add a reference to his tests (when integrated)

Jul 28 '18 17:07 LowLevelMahn

I benchmarked my to_chars() implementation powered by Ryu (which differs from upstream in that it's bounds-checked, follows sprintf() conventions for exponent style, and pays additional overhead for chars_format and centralized INF/NAN logic; Ryu itself is strictly faster) against dtoa_milo(). Ryu is 70% to 90% faster, and additionally rounds correctly. See: https://www.reddit.com/r/cpp/comments/9fiko6/fmt_version_52_released_with_up_to_5x_speedups/e5xmve6/

Perhaps this issue should be resolved, since this is indeed not a benchmark repository and it doesn't seem that doing anything else here would be a valuable use of time? While it appears that double-conversion is not the fastest possible Grisu implementation, it has the advantage of matching Ryu's correctly rounded output, and Ryu indeed appears to be the fastest in the world by a wide margin, so the choice of comparison algorithm seems unimportant (except for "promotional" purposes). In my optimization efforts, I've focused on improving Ryu's absolute performance; the relative performance (especially against sprintf()) is merely entertaining rather than useful.

Sep 13 '18 23:09 StephanTLavavej

I think here should be sets of parameterized benchmarks , at least to track progress through changes and drive experiments.

Jsoniter-scala (one of first adopters of Ryu) has a growing set of benchmarks from the beginning. It helped a lot to drive improvements using different profilers and to spot regressions on new JVMs. Here are results from the latest run:

https://plokhotnyuk.github.io/jsoniter-scala/

E. g. a lot of mitigations where done to avoid 3x times slowdown regression with Graal due missing compiler optimization for division by constant:

https://github.com/oracle/graal/issues/593

Sep 14 '18 02:09 plokhotnyuk

Take look to my dtoa-benchmark fork.

$ git clone https://github.com/leo-yuriev/dtoa-benchmark
$ cd dtoa-benchmark
$ cmake . && cmake --build .

C/C++ results sorted by RMS:

$ ./d2a-benchmark | grep 'Bench' | LC_ALL=C sort -k 1.70
Benchmarking randomdigit null                 ... [min    1.500ns, rms    1.500ns, max    1.500ns]
Benchmarking randomdigit erthink              ... [min   22.700ns, rms   43.333ns, max   58.500ns]
Benchmarking randomdigit ryu                  ... [min   41.200ns, rms   55.637ns, max   65.700ns]
Benchmarking randomdigit milo                 ... [min   34.400ns, rms   58.060ns, max   71.800ns]
Benchmarking randomdigit emyg                 ... [min   36.100ns, rms   58.945ns, max   72.800ns]
Benchmarking randomdigit floaxie              ... [min   27.100ns, rms   71.038ns, max   94.200ns]
Benchmarking randomdigit grisu2               ... [min   74.700ns, rms   90.063ns, max  107.900ns]
Benchmarking randomdigit doubleconv           ... [min  100.500ns, rms  139.102ns, max  165.800ns]
Benchmarking randomdigit fmt                  ... [min  117.000ns, rms  147.098ns, max  171.800ns]
Benchmarking randomdigit fpconv               ... [min   99.900ns, rms  149.201ns, max  173.100ns]
Benchmarking randomdigit sprintf              ... [min  804.900ns, rms  902.667ns, max  980.300ns]
Benchmarking randomdigit ostrstream           ... [min 1192.300ns, rms 1292.424ns, max 1375.400ns]
Benchmarking randomdigit ostringstream        ... [min 1249.000ns, rms 1358.195ns, max 1451.100ns]

 $ gcc --version
gcc (Ubuntu 8.3.0-6ubuntu1~18.04) 8.3.0
...

May 31 '19 12:05 erthink

@leo-yuriev could you, please, update your charts for different number of digits to see results of ryu and erthink in them?

May 31 '19 13:05 plokhotnyuk

@plokhotnyuk, today I spent some time to Refine the benchmark to make it easier and more convenient to get a results (I pushed those changes for now). And noticed that in different modes of the benchmark results 'ryu' and 'erthink' are significantly different.

RandomDigit: Generates 1000 random double values, filtered out +/-inf and nan. Then convert them to limited precision (1 to 17 decimal digits in significand). Finally convert these numbers into ASCII.

Function	Min ns	RMS ns	Max ns	Sum ns	Speedup
erthink	21.0	183.412	60.0	733.5	21.00
ryu	42.1	237.773	74.7	966.6	15.94
milo	33.5	257.925	77.4	1046.7	14.72
floaxie	27.6	304.451	96.9	1198.3	12.86
emyg	38.4	296.706	86.3	1204.1	12.79
grisu2	71.2	369.677	107.5	1515.9	10.16
doubleconv	90.7	513.139	167.8	2085.4	7.39
fmt	95.7	523.818	150.4	2148.8	7.17
fpconv	99.2	649.942	180.6	2650.6	5.81
sprintf	828.5	3740.842	965.3	15405.2	1.00
ostrstream	1217.2	5354.421	1359.8	22062.9	0.70
ostringstream	1274.9	5628.034	1434.8	23186.3	0.66

The "sequental" testcase: Actually it converts 10000 numbers with random mantissa for each decimal exponent in range 0...16. So, I think we can treats it as a random numbers with some known distribution/bias of an exponent.

Function	Min ns	RMS ns	Max ns	Sum ns	Speedup
ryu	14.7	90.250	25.8	367.8	21.48
erthink	18.0	135.354	46.4	543.9	14.53
milo	21.0	149.139	46.8	604.0	13.08
emyg	21.1	149.400	46.7	604.9	13.06
grisu2	55.0	285.169	81.0	1168.3	6.76
floaxie	32.8	300.912	95.8	1189.6	6.64
doubleconv	54.5	319.500	106.3	1297.9	6.09
fmt	81.7	413.348	114.7	1696.6	4.66
fpconv	90.4	539.581	156.5	2179.6	3.62
sprintf	230.0	2020.639	709.7	7901.0	1.00
ostrstream	585.7	3429.470	1062.9	13880.7	0.57
ostringstream	571.1	3480.388	1130.1	14048.8	0.56

So, my preliminary findings are:

erthink is faster when value could be represented by 12 or less digits (I checked this by changing range of this loop);
ryu is faster in the all other cases;

P.S. The "RMS" in tables actually is a sqrt(sum of squares), i.e. has not normalized for number of values..

May 31 '19 21:05 erthink

ryu ryu copied to clipboard

Improvement: Compare with more algorithms in benchmark

ryu
ryu copied to clipboard