crystal Use Ryu algorithm for floating point to string conversation

It was just brought to my attention by a friend that Crystal uses the Grisu3 algorithm for converting floating point numbers to strings. Apparently switching to the Ryu algorithm could offer a boost in performance. CC: @Zenohate

Nov 05 '19 17:11 watzon

Numbers look great https://github.com/ulfjack/ryu#ryu-printf

Has comparison with grisu2 (no grisu3 though) here among others

So, my preliminary findings are:

erthink is faster when value could be represented by 12 or less digits (I checked this by changing range of this loop);

ryu is faster in the all other cases;

https://github.com/ulfjack/ryu/issues/28#issuecomment-497864191

Nov 05 '19 18:11 vlazar

I will tackle an implementation based of the C one, It will take time tho and may bear no fruits. I am currently fighting the new overflow safe Int a lot doing all of that.

I am copying the API of the Grisu3 class for now so that it can be easy to replace as well as writing the code in a separate library which it will be possible to merge later.

Best regards

Nov 05 '19 22:11 Archivist062

Excited to see what you can come up with @Zenohate

Nov 05 '19 23:11 watzon

Oh my... what a rabbit hole. Why did I've started following links for ryu on GitHub 😆

Anyways... There is a discussion about implementing ryu in Go. Lots of interesting stuff.

This comment in particular might help correctness checking https://github.com/golang/go/issues/15672#issuecomment-476012114

Nov 06 '19 07:11 vlazar

Also as mentioned in the paper

The main disadvantage of Ry ̄u is its reliance on lookuptables. These are quite reasonable for the typical 32-bit and64-bit floating point types but grow exponentially for largertypes. It is up to future work to determine whether it ispossible to reduce their size or even avoid them entirely.

Nov 06 '19 07:11 vlazar

I guess we could always keep Grisu3 around for bigger than 64bit floats then.

Nov 06 '19 08:11 jhass

There's also no good way to have static lookup tables in Crystal like they do in C, but once there's a Ryu implementation we'll see.

Nov 06 '19 11:11 asterite

@asterite There is also a Java version https://github.com/ulfjack/ryu/tree/master/src/main/java/info/adams/ryu but I guess you mean Crystal needs static lookup tables closer to C for such cases? This can allow even more optimizations?

Nov 06 '19 12:11 vlazar

I guess C will dump those static lookup tables to the data segment of the program. In Java it seems they are initialized at runtime. In Crystal it'll probably be have to be the same. Maybe it'll still be fast enough.

Nov 06 '19 13:11 asterite

Maybe Crystal can use something similar to Ruby's __END__ + DATA magic for such data?

Nov 06 '19 14:11 vlazar

The readme does mention Grisu3 in some comparisons, FWIW (stated to be significantly faster).

Nov 29 '19 04:11 rdp

There is also Dragonbox now.

Jul 08 '21 19:07 HertzDevil

Dragonbox implements only the shortest round-trip conversion; while this is good enough for Float#to_s, it is unsuitable for sprintf %f / %e / %g because double rounding would occur. Instead we must still implement Ryu Printf if we want sprintf to be platform-independent by not relying on LibC.snprintf. (Ryu is the shortest representation algorithm, Ryu Printf is the fixed precision one.)

Sep 09 '22 19:09 HertzDevil

@HertzDevil Hello, thank you for porting Dragonbox, I'm the author of the algorithm. This is just some information just in case you're interested.

The biggest problem of Ryu printf is its enormous size (102kb iirc) of static table. I actually have an implementation of (almost) same algorithm which only requires 39kb here. It's not extensively tested like Dragonbox though. Also it is probably a little bit slower than the original implementation for double but the difference is not big as you can see in the benchmark graph shown in the README.

In fact, I'm working on an alternative algorithm (which works similarly to Ryu printf but using a different formula for the core computation), which I expect to be faster than Ryu printf but at the same time only requires 4-5kb (or even half or quarter of that if you can afford sacrificing more performance). At this point I have an almost-working implementation, but I sort of stopped finishing it due to other things in my life having higher priority. Maybe in early next year, I'll publish the work and advertise it through reddit.

Sep 11 '22 20:09 jk-jeon

We have other similarly sized lookup tables (Unicode properties) so that is probably not an issue for now. But I will definitely keep an eye out for that Ryu Printf replacement.

Sep 12 '22 18:09 HertzDevil

I made a working implementation of the said algorithm (here is the repo, and here is the explanation) in last December but forgot to mention it here. Currently I consider the implementation is largely incomplete, though (I believe) it works fine for all double inputs. I think it's not ready for serious adaptions yet for various reasons, but wanted to say that a better algorithm definitely exists.

Roughly speaking, for the first few digits (up to 17~19) it performs the usual 128-bit x 64-bit multiplication that Dragonbox/Ryu/etc. do by re-using the same table, and for further digits it falls-back into a slower mechanism relying on an additional table.

For this slower fallback mechanism, there are several tunable parameters that determine the trade-off between the size of the additional static data and the required number of multiplications per a digit. The most significant parameter is something I call the segment length, which roughly speaking is the number of digits that are generated "at once". I tried to set this segment length to be 22 and 252, each resulting in the data size of 3680 bytes and 580 bytes, respectively. For the first case, it performs several 192-bit x 64-bit multiplications to obtain 22 digits, while for the second case it performs 960-bit x 64-bit multiplications instead. It seems the performance of the first case is more or less equivalent to Ryu-printf; it wasn't a lot faster than Ryu-printf, unfortunately. You can refer to the benchmark graph I included in the repo.

Currently I'm thinking that it will probably perform better if I extend the Dragonbox table a little bit (which allows some simplification of the first phase of printing first few digits), and also set the segment length to be 18 instead of 22 at the expense of a little bit larger table size, but I haven't done any experiment.

Jan 24 '23 17:01 jk-jeon

We have other similarly sized lookup tables (Unicode properties) so that is probably not an issue for now

To get a sense of how large 102 KB is, I logged the effective sizes of those lookup tables: (needs crystal-lang/perf-tools#13)

Source code:

require "perf_tools/mem_prof"
require "html"

class Array(T)
  # TODO: some constants do not get registered in `PerfTools::MemProf`
  def fallback_reachable_size
    {% if T < Value %}
      instance_sizeof(self) + sizeof(T) * @capacity
    {% else %}
      0
    {% end %}
  end
end

def log_size(obj, name)
  reachable = PerfTools::MemProf.object_size(obj)
  if reachable == 0 && obj.responds_to?(:fallback_reachable_size)
    reachable = obj.fallback_reachable_size
  end

  puts "#{name} : #{obj.class}"
  puts "  #{obj.size} elements"
  puts "  #{reachable} bytes reachable"
end

module Unicode
  def self.log_sizes
    {% for cvar in @type.class_vars %}
      ::log_size({{cvar}}, "Unicode.{{cvar}}")
    {% end %}
  end

  log_sizes
end

struct String::Grapheme
  log_size(codepoints, "String::Grapheme.codepoints")
end

log_size(HTML::SINGLE_CHAR_ENTITIES, "HTML::SINGLE_CHAR_ENTITIES")
log_size(HTML::DOUBLE_CHAR_ENTITIES, "HTML::DOUBLE_CHAR_ENTITIES")

module Float::Printer::Dragonbox
  log_size(ImplInfo_Float32::CACHE, "Float::Printer::Dragonbox::ImplInfo_Float32::CACHE")
  log_size(ImplInfo_Float64::CACHE, "Float::Printer::Dragonbox::ImplInfo_Float64::CACHE")
end

module Float::Printer::RyuPrintf
  log_size(POW10_SPLIT, "Float::Printer::RyuPrintf::POW10_SPLIT")
  log_size(POW10_SPLIT_2, "Float::Printer::RyuPrintf::POW10_SPLIT_2")
end

Output:

Unicode.upcase_ranges : Array(Tuple(Int32, Int32, Int32))
  141 elements
  1716 bytes reachable
Unicode.downcase_ranges : Array(Tuple(Int32, Int32, Int32))
  125 elements
  1524 bytes reachable
Unicode.alternate_ranges : Array(Tuple(Int32, Int32))
  60 elements
  504 bytes reachable
Unicode.category_Lu : Array(Tuple(Int32, Int32, Int32))
  149 elements
  1812 bytes reachable
Unicode.category_Ll : Array(Tuple(Int32, Int32, Int32))
  163 elements
  1980 bytes reachable
Unicode.category_Lt : Array(Tuple(Int32, Int32, Int32))
  7 elements
  108 bytes reachable
Unicode.category_Lm : Array(Tuple(Int32, Int32, Int32))
  54 elements
  672 bytes reachable
Unicode.category_Lo : Array(Tuple(Int32, Int32, Int32))
  486 elements
  5856 bytes reachable
Unicode.category_Mn : Array(Tuple(Int32, Int32, Int32))
  308 elements
  3720 bytes reachable
Unicode.category_Mc : Array(Tuple(Int32, Int32, Int32))
  158 elements
  1920 bytes reachable
Unicode.category_Me : Array(Tuple(Int32, Int32, Int32))
  5 elements
  84 bytes reachable
Unicode.category_Nd : Array(Tuple(Int32, Int32, Int32))
  64 elements
  792 bytes reachable
Unicode.category_Nl : Array(Tuple(Int32, Int32, Int32))
  11 elements
  156 bytes reachable
Unicode.category_No : Array(Tuple(Int32, Int32, Int32))
  71 elements
  876 bytes reachable
Unicode.category_Zs : Array(Tuple(Int32, Int32, Int32))
  5 elements
  84 bytes reachable
Unicode.category_Zl : Array(Tuple(Int32, Int32, Int32))
  1 elements
  36 bytes reachable
Unicode.category_Zp : Array(Tuple(Int32, Int32, Int32))
  1 elements
  36 bytes reachable
Unicode.category_Cc : Array(Tuple(Int32, Int32, Int32))
  2 elements
  48 bytes reachable
Unicode.category_Cf : Array(Tuple(Int32, Int32, Int32))
  18 elements
  240 bytes reachable
Unicode.category_Cs : Array(Tuple(Int32, Int32, Int32))
  3 elements
  60 bytes reachable
Unicode.category_Co : Array(Tuple(Int32, Int32, Int32))
  3 elements
  60 bytes reachable
Unicode.category_Cn : Array(Tuple(Int32, Int32, Int32))
  0 elements
  24 bytes reachable
Unicode.casefold_ranges : Array(Tuple(Int32, Int32, Int32))
  681 elements
  8196 bytes reachable
Unicode.special_cases_downcase : Hash(Int32, Tuple(Int32, Int32, Int32))
  1 elements
  216 bytes reachable
Unicode.special_cases_upcase : Hash(Int32, Tuple(Int32, Int32, Int32))
  102 elements
  2872 bytes reachable
Unicode.special_cases_titlecase : Hash(Int32, Tuple(Int32, Int32, Int32))
  81 elements
  2872 bytes reachable
Unicode.fold_cases : Hash(Int32, Tuple(Int32, Int32, Int32))
  104 elements
  2872 bytes reachable
Unicode.canonical_combining_classes : Array(Tuple(Int32, Int32, UInt8))
  392 elements
  4728 bytes reachable
Unicode.canonical_decompositions : Hash(Int32, Tuple(Int32, Int32))
  2061 elements
  81976 bytes reachable
Unicode.compatibility_decomposition_data : Array(Int32)
  3230 elements
  12944 bytes reachable
Unicode.compatibility_decompositions : Hash(Int32, Tuple(Int32, Int32))
  3796 elements
  81976 bytes reachable
Unicode.canonical_compositions : Hash(Int64, Int32)
  941 elements
  28728 bytes reachable
Unicode.nfc_quick_check : Array(Tuple(Int32, Int32, Unicode::QuickCheckResult))
  117 elements
  1428 bytes reachable
Unicode.nfkc_quick_check : Array(Tuple(Int32, Int32, Unicode::QuickCheckResult))
  436 elements
  5256 bytes reachable
Unicode.nfd_quick_check : Array(Tuple(Int32, Int32))
  243 elements
  1968 bytes reachable
Unicode.nfkd_quick_check : Array(Tuple(Int32, Int32))
  548 elements
  4408 bytes reachable
String::Grapheme.codepoints : Array(Tuple(Int32, Int32, String::Grapheme::Property))
  1447 elements
  17388 bytes reachable
HTML::SINGLE_CHAR_ENTITIES : Hash(Slice(UInt8), Char)
  2032 elements
  73784 bytes reachable
HTML::DOUBLE_CHAR_ENTITIES : Hash(Slice(UInt8), String)
  93 elements
  4408 bytes reachable
Float::Printer::Dragonbox::ImplInfo_Float32::CACHE : Array(UInt64)
  78 elements
  648 bytes reachable
Float::Printer::Dragonbox::ImplInfo_Float64::CACHE : Array(Float::Printer::Dragonbox::WUInt::UInt128)
  619 elements
  9928 bytes reachable
Float::Printer::RyuPrintf::POW10_SPLIT : Array(Tuple(UInt64, UInt64, UInt64))
  1224 elements
  29400 bytes reachable
Float::Printer::RyuPrintf::POW10_SPLIT_2 : Array(Tuple(UInt64, UInt64, UInt64))
  3133 elements
  75216 bytes reachable

In short, the Ryu-printf tables are indeed comparable to the tables used for Unicode normalization, or for HTML entity (un)escaping. However, the code itself used to populate those tables is much larger than the tables' contents. Here I compile the following code, and then compare the resulting binary's size against a blank source file, on my Apple M2: (at the moment the new tables won't be codegen'ed at all unless RyuPrintf's methods are called explicitly, because stdlib doesn't incorporate them yet)

Float::Printer::RyuPrintf.d2fixed(1.23, 5)
Float::Printer::RyuPrintf.d2exp(1.23, 5)

Non-release: 1231112 B -> 1630104 B (+398992 B, +32.4% size)
Release: 298360 B -> 514024 B (+215664 B, +72.3% size)

On the other hand, if the new tables are backed by Slice.literal instead, then the release mode binary would unconditionally grow to roughly 400 KB, and the effect on non-release builds would drop below 10%.

Nov 07 '23 16:11 HertzDevil

There's a functional difference though in the purposes: Unicode normalization and HTML entity tables are unavoidable. If you want to normalize Unicode or (un)escape HTML entities, you explicitly need those tables. There's no way around that. Maybe there's some room for compacting their representation to save some bytes, but that's only a minor effect.

With the lookup tables for the Ryu algorithm, that's different: They're not strictly necessary for the general task, just for this specific implementation. If you want to print floating point numbers, there are other algorithms which don't need those tables. Their motivation is an optional performance trade-off to improve execution time over memory size.

For most general programs that's a good deal because size usually doesn't matter much. So it's probably fine as a default.

However, there might be use cases where you don't want that compromise. Maybe we could allow choosing the algorithm at compile time? I suppose that's always possible with monkey-patching into stdlib. But perhaps a more stable mechanism could be helpful. Alternative implementations don't necessarily need to be distributed in stdlib, though.

Nov 07 '23 16:11 straight-shoota