STL
STL copied to clipboard
`<charconv>`: Investigate using Dragonbox for `to_chars()` decimal shortest
Repo: https://github.com/jk-jeon/Grisu-Exact
Reddit posts: Aug 2020 and earlier Jan 2020.
It is a variant of Grisu2 a classic float-to-string conversion algorithm developed by Florian Loitsch, and is largely inspired by Ryu, another float-to-string conversion algorithm developed by Ulf Adams. The main strength of Grisu-Exact over Ryu is that it performs better for short numbers.
The algorithm's inventor and implementer, Junekey Jeon, has graciously licensed this code so that it's compatible with microsoft/STL (the bulk of the code is dual Apache 2.0 + LLVM / Boost; fp_to_chars.cpp is Apache 2.0 / Boost as it's derived from Ryu). :heart_eyes_cat:
FYI: there is even faster one (with almost same size of table). Please have a look at the implementation of Schubfach in this repository: https://github.com/abolz/Drachennest
Also see https://github.com/jk-jeon/dragonbox .
Just FYI.
After Dragonbox, I'm working on Ryu-printf now. I think the core routine is now working correctly, and it uses about 39KB of static data. Although 39KB is still huge, that's way better than 100KB. (I don't think I really understood what was the precise table compression scheme that Ulf came up with, so I have no idea why he ended up using more than 100KB, because what I did was just to follow the paper pretty straightforwardly.) But I didn't do any performance evaluation against the original implementation yet. It is possible that mine is much slower than the original one, though I honestly think that the performances should be not that different if I didn't do something stupid on the string generation part.
I also think it is possible to further compress the table by using more(!) bits per an entry, but at the cost of speed because then we need things like 192-bit modular operation and/or 320-bit multiplication or something like that. Maybe I can work on that later.
After Ryu-printf, I will work on arbitrary-precision string-to-float conversion, using the core of Ryu-printf. That should be not so difficult I guess.
Please look at the repo if you are interested. (It is not well furnished yet, sorry.) It would be great if you can give me some suggestion or advice or anything, if you have time. I mean, the core routine (binary-to-decimal conversion) is actually pretty simple once the algorithm is understood, but the string generation part with the correct rounding was in fact much more annoying. I think you also have worked on string generation part of the original Ryu-printf implementation, so may know better about it than me.
39KB is indeed amazing. Unfortunately I'm very busy at the moment trying to help finish our C++20 implementation before the end of 2020 - but if you haven't already, I recommend using our charconv test data to validate the correctness of your algorithm. See the files in https://github.com/microsoft/STL/tree/master/tests/std/tests/P0067R5_charconv - the files like double_scientific_precision_to_chars_test_cases_1.hpp, double_scientific_precision_to_chars_test_cases_2.hpp, etc. contain an extensive set of test cases, derived from Ulf's, that I developed when testing Ryu Printf (and which found a correctness bug in his early implementation). They're structured as flat tables so it should be very easy to adapt them to your code.
I think you also have worked on string generation part of the original Ryu-printf implementation, so may know better about it than me.
I adapted it to charconv's non-null-terminated, bounds-checked interface but didn't change the core structure of Ulf's code. Looking at your code I don't see anything that could be immediately improved, but I might be able to think of something with sufficient time to explore later.
@jk-jeon posted a new algorithm today, further improving table sizes: https://jk-jeon.github.io/posts/2022/12/fixed-precision-formatting/ and source code at https://github.com/jk-jeon/floff . Thanks again! :heart_eyes_cat:
(Commenting here so I'll remember this when I find time to overhaul <charconv>.)
Commenting for future investigation:
Cassio Neri has announced a new algorithm called 'teju', see https://schedule.cppnow.org/session/a-new-dragon-in-the-den-fast-conversion-from-floating-point-numbers/ and slides at https://schedule.cppnow.org/wp-content/uploads/2024/02/a-new-dragon-in-the-den.pdf . It's almost exactly 2x faster than Ryu, with table size the same as Dragonbox (i.e. slightly smaller than Ryu, IIRC). The "centered" benchmarks are the relevant ones, as the "uncentered" case is for extremely rare inputs.
A paper has not yet been published, and a reference implementation is not yet available. As usual, we would need either the Apache License v2.0 with LLVM Exception (preferred) or the Boost Software License in order to derive from any reference implementation.
Cassio Neri seems to have made the repository available at https://github.com/cassioneri/teju_jagua under Apache 2.0.
I ran the benchmark found in the repo https://github.com/cassioneri/teju_jagua, with small tweaks:
- I disabled the integer test as it took exceptionally long and I personally think it's rather uninteresting. Teju, or really any algorithm even some optimized variants of Grisu2, winning over Dragonbox on this is totally unsurprising because I intentionally did not add any special shortcuts for integers. In my humble opinion, I do not think integers are particularly common among generic floating-point data. Sure, integers are particularly important for application like JavaScript engine since JavaScript does not distinguish integers and floating-points, but I honestly think that's the only place where integers are really that common to the point of needing a special attention. Even for variety of JSON libraries for C++, I find that they usually do distinguish integers and floating-points internally, so benefit of having a dedicated shortcut for integers is questionable. Furthermore, such a shortcut can be added externally if needed. I expect that the additional overhead incurred by such an addition compared to the fully integrated shortcut should be close to nonexistent.
- I used the most recent commit (fe0e5fe9e934fadd1ee9cc36b20bcef44fe740dd) rather than v1.1.3 for Dragonbox. There have been some further optimizations after v1.1.3, but since I have not run the benchmark with v1.1.3, I do not know if it matters. But given the result below, probably it matters.
- I turned on AVX2. Again I did not test without it (I always turn it on because it's virtually universally available these days) so I do not know if it matters.
So here is the result I got with clang-cl:
[==========] Running 4 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 2 tests from float
[ RUN ] float.centred
teju (mean ) = 5.625 ns
(stddev) = 0.447 ns
(rel. ) = 1.000
dragonbox (mean ) = 5.139 ns
(stddev) = 0.964 ns
(rel. ) = 0.914
ryu (mean ) = 9.017 ns
(stddev) = 0.828 ns
(rel. ) = 1.603
[ OK ] float.centred (44330 ms)
[ RUN ] float.uncentred
teju (mean ) = 5.194 ns
(stddev) = 0.753 ns
(rel. ) = 1.000
dragonbox (mean ) = 5.246 ns
(stddev) = 0.422 ns
(rel. ) = 1.010
ryu (mean ) = 9.859 ns
(stddev) = 2.462 ns
(rel. ) = 1.898
[ OK ] float.uncentred (4419 ms)
[----------] 2 tests from float (48750 ms total)
[----------] 2 tests from double
[ RUN ] double.centred
teju (mean ) = 8.679 ns
(stddev) = 2.414 ns
(rel. ) = 1.000
dragonbox (mean ) = 8.381 ns
(stddev) = 2.808 ns
(rel. ) = 0.966
ryu (mean ) = 13.492 ns
(stddev) = 3.792 ns
(rel. ) = 1.555
[ OK ] double.centred (338688 ms)
[ RUN ] double.uncentred
teju (mean ) = 19.982 ns
(stddev) = 3.602 ns
(rel. ) = 1.000
dragonbox (mean ) = 19.515 ns
(stddev) = 3.702 ns
(rel. ) = 0.977
ryu (mean ) = 30.539 ns
(stddev) = 7.788 ns
(rel. ) = 1.528
[ OK ] double.uncentred (35224 ms)
[----------] 2 tests from double (373913 ms total)
[----------] Global test environment tear-down
[==========] 4 tests from 2 test suites ran. (422664 ms total)
[ PASSED ] 4 tests.
It's also worth to mention that the centered benchmark tests uniformly random data, which means 90% of all benchmarked inputs have large number of digits (7~9 for floats, 16~17 for doubles).