Use Ryu algorithm for floating point to string conversation
It was just brought to my attention by a friend that Crystal uses the Grisu3 algorithm for converting floating point numbers to strings. Apparently switching to the Ryu algorithm could offer a boost in performance. CC: @Zenohate
Numbers look great https://github.com/ulfjack/ryu#ryu-printf
Has comparison with grisu2 (no grisu3 though) here among others
So, my preliminary findings are:
- erthink is faster when value could be represented by 12 or less digits (I checked this by changing range of this loop);
- ryu is faster in the all other cases;
https://github.com/ulfjack/ryu/issues/28#issuecomment-497864191
I will tackle an implementation based of the C one, It will take time tho and may bear no fruits. I am currently fighting the new overflow safe Int a lot doing all of that.
I am copying the API of the Grisu3 class for now so that it can be easy to replace as well as writing the code in a separate library which it will be possible to merge later.
Best regards
Excited to see what you can come up with @Zenohate
Oh my... what a rabbit hole. Why did I've started following links for ryu on GitHub 😆
Anyways... There is a discussion about implementing ryu in Go. Lots of interesting stuff.
This comment in particular might help correctness checking https://github.com/golang/go/issues/15672#issuecomment-476012114
Also as mentioned in the paper
The main disadvantage of Ry ̄u is its reliance on lookuptables. These are quite reasonable for the typical 32-bit and64-bit floating point types but grow exponentially for largertypes. It is up to future work to determine whether it ispossible to reduce their size or even avoid them entirely.
I guess we could always keep Grisu3 around for bigger than 64bit floats then.
There's also no good way to have static lookup tables in Crystal like they do in C, but once there's a Ryu implementation we'll see.
@asterite There is also a Java version https://github.com/ulfjack/ryu/tree/master/src/main/java/info/adams/ryu but I guess you mean Crystal needs static lookup tables closer to C for such cases? This can allow even more optimizations?
I guess C will dump those static lookup tables to the data segment of the program. In Java it seems they are initialized at runtime. In Crystal it'll probably be have to be the same. Maybe it'll still be fast enough.
Maybe Crystal can use something similar to Ruby's __END__ + DATA magic for such data?
The readme does mention Grisu3 in some comparisons, FWIW (stated to be significantly faster).
There is also Dragonbox now.
Dragonbox implements only the shortest round-trip conversion; while this is good enough for Float#to_s, it is unsuitable for sprintf %f / %e / %g because double rounding would occur. Instead we must still implement Ryu Printf if we want sprintf to be platform-independent by not relying on LibC.snprintf. (Ryu is the shortest representation algorithm, Ryu Printf is the fixed precision one.)
@HertzDevil Hello, thank you for porting Dragonbox, I'm the author of the algorithm. This is just some information just in case you're interested.
The biggest problem of Ryu printf is its enormous size (102kb iirc) of static table. I actually have an implementation of (almost) same algorithm which only requires 39kb here. It's not extensively tested like Dragonbox though. Also it is probably a little bit slower than the original implementation for double but the difference is not big as you can see in the benchmark graph shown in the README.
In fact, I'm working on an alternative algorithm (which works similarly to Ryu printf but using a different formula for the core computation), which I expect to be faster than Ryu printf but at the same time only requires 4-5kb (or even half or quarter of that if you can afford sacrificing more performance). At this point I have an almost-working implementation, but I sort of stopped finishing it due to other things in my life having higher priority. Maybe in early next year, I'll publish the work and advertise it through reddit.
We have other similarly sized lookup tables (Unicode properties) so that is probably not an issue for now. But I will definitely keep an eye out for that Ryu Printf replacement.
I made a working implementation of the said algorithm (here is the repo, and here is the explanation) in last December but forgot to mention it here. Currently I consider the implementation is largely incomplete, though (I believe) it works fine for all double inputs. I think it's not ready for serious adaptions yet for various reasons, but wanted to say that a better algorithm definitely exists.
Roughly speaking, for the first few digits (up to 17~19) it performs the usual 128-bit x 64-bit multiplication that Dragonbox/Ryu/etc. do by re-using the same table, and for further digits it falls-back into a slower mechanism relying on an additional table.
For this slower fallback mechanism, there are several tunable parameters that determine the trade-off between the size of the additional static data and the required number of multiplications per a digit. The most significant parameter is something I call the segment length, which roughly speaking is the number of digits that are generated "at once". I tried to set this segment length to be 22 and 252, each resulting in the data size of 3680 bytes and 580 bytes, respectively. For the first case, it performs several 192-bit x 64-bit multiplications to obtain 22 digits, while for the second case it performs 960-bit x 64-bit multiplications instead. It seems the performance of the first case is more or less equivalent to Ryu-printf; it wasn't a lot faster than Ryu-printf, unfortunately. You can refer to the benchmark graph I included in the repo.
Currently I'm thinking that it will probably perform better if I extend the Dragonbox table a little bit (which allows some simplification of the first phase of printing first few digits), and also set the segment length to be 18 instead of 22 at the expense of a little bit larger table size, but I haven't done any experiment.
We have other similarly sized lookup tables (Unicode properties) so that is probably not an issue for now
To get a sense of how large 102 KB is, I logged the effective sizes of those lookup tables: (needs crystal-lang/perf-tools#13)
Source code:
require "perf_tools/mem_prof"
require "html"
class Array(T)
# TODO: some constants do not get registered in `PerfTools::MemProf`
def fallback_reachable_size
{% if T < Value %}
instance_sizeof(self) + sizeof(T) * @capacity
{% else %}
0
{% end %}
end
end
def log_size(obj, name)
reachable = PerfTools::MemProf.object_size(obj)
if reachable == 0 && obj.responds_to?(:fallback_reachable_size)
reachable = obj.fallback_reachable_size
end
puts "#{name} : #{obj.class}"
puts " #{obj.size} elements"
puts " #{reachable} bytes reachable"
end
module Unicode
def self.log_sizes
{% for cvar in @type.class_vars %}
::log_size({{cvar}}, "Unicode.{{cvar}}")
{% end %}
end
log_sizes
end
struct String::Grapheme
log_size(codepoints, "String::Grapheme.codepoints")
end
log_size(HTML::SINGLE_CHAR_ENTITIES, "HTML::SINGLE_CHAR_ENTITIES")
log_size(HTML::DOUBLE_CHAR_ENTITIES, "HTML::DOUBLE_CHAR_ENTITIES")
module Float::Printer::Dragonbox
log_size(ImplInfo_Float32::CACHE, "Float::Printer::Dragonbox::ImplInfo_Float32::CACHE")
log_size(ImplInfo_Float64::CACHE, "Float::Printer::Dragonbox::ImplInfo_Float64::CACHE")
end
module Float::Printer::RyuPrintf
log_size(POW10_SPLIT, "Float::Printer::RyuPrintf::POW10_SPLIT")
log_size(POW10_SPLIT_2, "Float::Printer::RyuPrintf::POW10_SPLIT_2")
end
Output:
Unicode.upcase_ranges : Array(Tuple(Int32, Int32, Int32))
141 elements
1716 bytes reachable
Unicode.downcase_ranges : Array(Tuple(Int32, Int32, Int32))
125 elements
1524 bytes reachable
Unicode.alternate_ranges : Array(Tuple(Int32, Int32))
60 elements
504 bytes reachable
Unicode.category_Lu : Array(Tuple(Int32, Int32, Int32))
149 elements
1812 bytes reachable
Unicode.category_Ll : Array(Tuple(Int32, Int32, Int32))
163 elements
1980 bytes reachable
Unicode.category_Lt : Array(Tuple(Int32, Int32, Int32))
7 elements
108 bytes reachable
Unicode.category_Lm : Array(Tuple(Int32, Int32, Int32))
54 elements
672 bytes reachable
Unicode.category_Lo : Array(Tuple(Int32, Int32, Int32))
486 elements
5856 bytes reachable
Unicode.category_Mn : Array(Tuple(Int32, Int32, Int32))
308 elements
3720 bytes reachable
Unicode.category_Mc : Array(Tuple(Int32, Int32, Int32))
158 elements
1920 bytes reachable
Unicode.category_Me : Array(Tuple(Int32, Int32, Int32))
5 elements
84 bytes reachable
Unicode.category_Nd : Array(Tuple(Int32, Int32, Int32))
64 elements
792 bytes reachable
Unicode.category_Nl : Array(Tuple(Int32, Int32, Int32))
11 elements
156 bytes reachable
Unicode.category_No : Array(Tuple(Int32, Int32, Int32))
71 elements
876 bytes reachable
Unicode.category_Zs : Array(Tuple(Int32, Int32, Int32))
5 elements
84 bytes reachable
Unicode.category_Zl : Array(Tuple(Int32, Int32, Int32))
1 elements
36 bytes reachable
Unicode.category_Zp : Array(Tuple(Int32, Int32, Int32))
1 elements
36 bytes reachable
Unicode.category_Cc : Array(Tuple(Int32, Int32, Int32))
2 elements
48 bytes reachable
Unicode.category_Cf : Array(Tuple(Int32, Int32, Int32))
18 elements
240 bytes reachable
Unicode.category_Cs : Array(Tuple(Int32, Int32, Int32))
3 elements
60 bytes reachable
Unicode.category_Co : Array(Tuple(Int32, Int32, Int32))
3 elements
60 bytes reachable
Unicode.category_Cn : Array(Tuple(Int32, Int32, Int32))
0 elements
24 bytes reachable
Unicode.casefold_ranges : Array(Tuple(Int32, Int32, Int32))
681 elements
8196 bytes reachable
Unicode.special_cases_downcase : Hash(Int32, Tuple(Int32, Int32, Int32))
1 elements
216 bytes reachable
Unicode.special_cases_upcase : Hash(Int32, Tuple(Int32, Int32, Int32))
102 elements
2872 bytes reachable
Unicode.special_cases_titlecase : Hash(Int32, Tuple(Int32, Int32, Int32))
81 elements
2872 bytes reachable
Unicode.fold_cases : Hash(Int32, Tuple(Int32, Int32, Int32))
104 elements
2872 bytes reachable
Unicode.canonical_combining_classes : Array(Tuple(Int32, Int32, UInt8))
392 elements
4728 bytes reachable
Unicode.canonical_decompositions : Hash(Int32, Tuple(Int32, Int32))
2061 elements
81976 bytes reachable
Unicode.compatibility_decomposition_data : Array(Int32)
3230 elements
12944 bytes reachable
Unicode.compatibility_decompositions : Hash(Int32, Tuple(Int32, Int32))
3796 elements
81976 bytes reachable
Unicode.canonical_compositions : Hash(Int64, Int32)
941 elements
28728 bytes reachable
Unicode.nfc_quick_check : Array(Tuple(Int32, Int32, Unicode::QuickCheckResult))
117 elements
1428 bytes reachable
Unicode.nfkc_quick_check : Array(Tuple(Int32, Int32, Unicode::QuickCheckResult))
436 elements
5256 bytes reachable
Unicode.nfd_quick_check : Array(Tuple(Int32, Int32))
243 elements
1968 bytes reachable
Unicode.nfkd_quick_check : Array(Tuple(Int32, Int32))
548 elements
4408 bytes reachable
String::Grapheme.codepoints : Array(Tuple(Int32, Int32, String::Grapheme::Property))
1447 elements
17388 bytes reachable
HTML::SINGLE_CHAR_ENTITIES : Hash(Slice(UInt8), Char)
2032 elements
73784 bytes reachable
HTML::DOUBLE_CHAR_ENTITIES : Hash(Slice(UInt8), String)
93 elements
4408 bytes reachable
Float::Printer::Dragonbox::ImplInfo_Float32::CACHE : Array(UInt64)
78 elements
648 bytes reachable
Float::Printer::Dragonbox::ImplInfo_Float64::CACHE : Array(Float::Printer::Dragonbox::WUInt::UInt128)
619 elements
9928 bytes reachable
Float::Printer::RyuPrintf::POW10_SPLIT : Array(Tuple(UInt64, UInt64, UInt64))
1224 elements
29400 bytes reachable
Float::Printer::RyuPrintf::POW10_SPLIT_2 : Array(Tuple(UInt64, UInt64, UInt64))
3133 elements
75216 bytes reachable
In short, the Ryu-printf tables are indeed comparable to the tables used for Unicode normalization, or for HTML entity (un)escaping. However, the code itself used to populate those tables is much larger than the tables' contents. Here I compile the following code, and then compare the resulting binary's size against a blank source file, on my Apple M2: (at the moment the new tables won't be codegen'ed at all unless RyuPrintf's methods are called explicitly, because stdlib doesn't incorporate them yet)
Float::Printer::RyuPrintf.d2fixed(1.23, 5)
Float::Printer::RyuPrintf.d2exp(1.23, 5)
- Non-release: 1231112 B -> 1630104 B (+398992 B, +32.4% size)
- Release: 298360 B -> 514024 B (+215664 B, +72.3% size)
On the other hand, if the new tables are backed by Slice.literal instead, then the release mode binary would unconditionally grow to roughly 400 KB, and the effect on non-release builds would drop below 10%.
There's a functional difference though in the purposes: Unicode normalization and HTML entity tables are unavoidable. If you want to normalize Unicode or (un)escape HTML entities, you explicitly need those tables. There's no way around that. Maybe there's some room for compacting their representation to save some bytes, but that's only a minor effect.
With the lookup tables for the Ryu algorithm, that's different: They're not strictly necessary for the general task, just for this specific implementation. If you want to print floating point numbers, there are other algorithms which don't need those tables. Their motivation is an optional performance trade-off to improve execution time over memory size.
For most general programs that's a good deal because size usually doesn't matter much. So it's probably fine as a default.
However, there might be use cases where you don't want that compromise. Maybe we could allow choosing the algorithm at compile time? I suppose that's always possible with monkey-patching into stdlib. But perhaps a more stable mechanism could be helpful. Alternative implementations don't necessarily need to be distributed in stdlib, though.