yyjson new double to string algorithm

https://gitee.com/xjb714/dtoa-benchmark

English: The source code is in this link src/d2sci/d2s.cpp/d2s_sse.

Sse implementation based on schubfach, the performance is about 1.5 times that of dragonbox or dtoa algorithm in yyjson at present. Please decide whether to adopt it after testing.

This project also includes the realization of other floating-point printing algorithms, which can be used if you are interested.

This project only contains double-precision floating-point printing algorithm.

中文：源代码在这个链接src/d2sci/d2s.cpp/d2s_sse。

基于schubfach的sse实现，性能是dragonbox或目前yyjson中dtoa算法的1.5倍左右。请经过测试后决定是否采纳。

本项目还包含了其他浮点数打印算法的实现，如果感兴趣也可使用。

本项目中只包含双精度浮点数打印算法。

Mar 05 '25 15:03 xjb714

Thanks! Your new algorithm looks great performance-wise. But yyjson focuses on portability, so I can't use SSE or AVX-specific code directly.

I also have a benchmark project for testing number conversion algorithms: https://github.com/ibireme/c_numconv_benchmark The benchmark results are there:

Intel Core i5 13600KF (does not support AVX512)
Clang 18.1.3
Ubuntu 24.04 LTS

The yy here is my new floating-point to string algorithm. I'm still working on a paper to prove the mathematical correctness of the algorithm, so it's not in yyjson yet.

It improves binary-to-decimal conversion, which might help with your algorithm too: https://github.com/ibireme/c_numconv_benchmark/blob/master/vendor/yy_double/yy_double.c

Mar 05 '25 18:03 ibireme

English: I use icpx 2025.0.4 -O3 to compile my code. clang++ seems to be a little less efficient than icpx, and g++ seems to be even less efficient. Generally speaking, the performance of icpx after compilation is better. The link I sent before contains the static library d2sci2 (compiled by AMD R7-7840H). You can use the benchmark program to link this library and try the result of d2s_sse algorithm again. I made a benchmark in my notebook, which is a little different from your result. The following is my benchmark result. After downloading, I changed the file suffix to html.

report_dtoa3.txt

The SSE instruction set is now included in almost all consumer-grade CPUs, and there are very few CPUs without SSE instruction set. Maybe you can provide SSE, AVX2, and even AVX512 APIs in later versions of yyjson to make the best use of CPU performance, or dynamically select the API with the best performance according to CPU at runtime, but it seems to be difficult to realize.

My link also includes the function d2s_32v, which is based on the AVX512 implementation of Schubfach, and uses the AVX512 instruction to calculate the ASCII results of 32 double values in parallel, that is, to generate the shortest significant number. At present, it takes about 2.4ns to print a single double value in AMD R9-9950X, but it needs to be proved and improved. This API application scenario is to print a large number of double-precision floating-point numbers, which has a great performance improvement when the data volume is particularly large.

thank you

中文：我使用icpx 2025.0.4 -O3编译我的代码，使用clang++似乎效率会比icpx差一点点，g++似乎效率还要更低。一般来说icpx编译后的性能更好。我之前发的链接里包含了d2sci2这个静态库(由AMD R7-7840H编译)，你用benchmark程序链接这个库再试试d2s_sse算法结果。在我的笔记本上做了benchmark，和你的结果有一点点差别。下面是我的benchmark结果，下载后将文件后缀名改为html。

report_dtoa3.txt

SSE指令集几乎现在所有消费级CPU都包含，不包含SSE指令集的CPU已经很少了。也许你们可以在yyjson以后版本中提供SSE、AVX2、甚至AVX512的API去最大化利用CPU性能，也或者是在运行时去根据CPU动态选择性能最好的API，但这似乎实现起来会比较难。

我的链接中还包含了d2s_32v这个函数，这是基于Schubfach的AVX512实现，使用AVX512指令并行计算32个double值的ASCII结果，即生成最短有效数字，目前在AMD R9-9950X打印单个double值时间约2.4ns，但是还有待证明和完善，这个API应用场景是打印大量双精度浮点数，在数据量特别大时有非常大的性能提升。

谢谢！

Mar 06 '25 02:03 xjb714

@ibireme Wow, it seems it performs 128-bit x 64-bit multiplication only once, in all cases. Looking forward to reading your paper. Did you by any chance try to compare it with Teju as well?

Mar 06 '25 07:03 jk-jeon

@ibireme Your f64_bin_to_dec function code can be optimized in this way. Testing on my CPU can probably reduce 3 to 4 cycle.

Mar 06 '25 09:03 xjb714

Did you by any chance try to compare it with Teju as well?

@jk-jeon I did a quick test when teju_jagua first came out. It only handles binary-to-decimal and didn't seem to outperform Dragonbox. The author is still updating it, so I'll check it out again later.

Mar 06 '25 17:03 ibireme

@xjb714 I figured out why there are performance differences.

I tested it on Amazon EC2 and added your new algorithm to the c_numconv_benchmark project. Here are the results:

EC2 7a.xlarge AMD EPYC 9R14 (Zen 4)
clang 18.1.3 -march=native

EC2 c7i.xlarge Intel Xeon Platinum 8488C (Sapphire Rapids)
clang 18.1.3 -march=native

Apple M1 Pro
Apple clang 16.0.0

As you can see, your algorithm performs the best on AMD CPU, and the results are similar even with the same binary. I think the difference might be related to the specific microarchitecture, I'll look into it more later.

The SSE instruction set is now included in almost all consumer-grade CPUs, and there are very few CPUs without SSE instruction set. Maybe you can provide SSE, AVX2, and even AVX512 APIs in later versions of yyjson to make the best use of CPU performance, or dynamically select the API with the best performance according to CPU at runtime, but it seems to be difficult to realize.

I mainly work on Mac and mobile devices, so I don't focus much on x86. For yyjson, I want to keep it architecture-independent. When optimizing for performance, I try to keep things balanced so it doesn't perform too badly on any platform.

Mar 06 '25 17:03 ibireme

Your f64_bin_to_dec function code can be optimized in this way. Testing on my CPU can probably reduce 3 to 4 cycle.

@xjb714 I tried tuning for ARM64/X64 with GCC/Clang/MSVC on Compiler Explorer, and I kept running into cases where a change worked as expected in one compiler, but actually made things worse in another.

For example, replacing three breaks with one break works fine in ICX, which generates the right assembly. But Clang/GCC still optimize it into three branches with some extra instructions.

Mar 06 '25 18:03 ibireme

@ibireme Thank you for your information. Your algorithm does perform better on most hardware and does not depend on a specific instruction set. Your algorithm is really ingenious, and I look forward to your correctness proof. I have implemented the corresponding AVX-512 code according to the full path code in your f64_bin_to_dec function, and passed the random number test. At present, the performance is 10% higher than that of schubfach's AVX-512 implementation, but some intermediate calculation steps at present should be optimized and have better performance. Implement d2s_yy_32v function with code address in https://gitee.com/xjb714/dtoa-benchmark/blob/master/src/d2sci/d2s_32v.cpp. However, the print results need to be optimized when the absolute value of exp_decimal is in a relatively small range, such as [-7,8]. Maybe you can use the simd instruction set in the arm architecture to realize your algorithm, but I am not familiar with the simd instruction set of arm, and I don't have an arm architecture CPU.

The following are the test results of random double :

AMD R7 7840H(with AVX-512)
icpx 2025.0.4
Ubuntu 24.04.2 LTS

the d2s_yy_32v function is the AVX-512 implementation of your algorithm. the d2s_32v function is the AVX-512 implementation of shubfach algorithm.

Mar 07 '25 04:03 xjb714

@ibireme Performance optimization of yy_double_to_string. The source code link is as follows.

yy_double copy.txt

Improve the performance under random length test.

Mar 09 '25 03:03 xjb714

@ibireme There is a mistake in the above code. Use this file.

yy_double copy.txt

Mar 09 '25 05:03 xjb714

@xjb714 Your code works really well, but portability is still an issue. The result of __builtin_clzll(0) is undefined. For example, on x86_64, the default compiler options generate the bsr instead of lzcnt, which gives the wrong result. You can check it here: https://godbolt.org/z/bW9zj81Wv.

This code performs best on newer CPUs like Intel Haswell, AMD Excavator, and ARM64. To support older CPUs, you'll need to add extra checks, which might slightly hurt performance for full-length numbers.

Mar 10 '25 16:03 ibireme

@ibireme.
It seems that there is no problem to use the following code instead, and calculate the number of trailing zeros in the lower 16 digits. num8_1

u64 tz1 = __builtin_clzll(num8_1) >> 3; 
u64 tz2 = __builtin_clzll(num8_2) >> 3;
if(num8_1 == 0) tz1 = 8; 
u64 tz = ((num8_2 == 0) ? ( tz1 + 8) : tz2);

num8_1 is the ascii value of high 8 of the lower 16 digits minus 0x3030303030303030; num8_2 is the same.

Mar 10 '25 17:03 xjb714

@ibireme This code achieves the same function and the compiled result has no branch instructions. The penalty of branch prediction failure can be avoid. click this link: https://godbolt.org/z/fEa5oGcjo

Mar 10 '25 18:03 xjb714

@xjb714 Yes, what I mean is that these checks can slightly slow down full-length numbers.

For example, I tested with an Intel 13700 using clang -O3, making sure the new code uses cmov. Here's how the cycles changed compared to the original:

random binary: 65 -> 73
random length: 93 -> 73
fixed length: [54,68] -> [72,73]
normalized: 60 -> 67

random length and fixed length represent two extremes. Random length is harder on branch prediction, so removing branches leads to a big speedup. Fixed length is well-predicted by the CPU, so no penalty for misprediction, which means your branchless code could actually be slower because it has more instructions.

In real-world cases, data usually falls somewhere in between, but for structured JSON, it tends to be closer to the fixed-length (or normalized) scenario.

Mar 10 '25 18:03 ibireme

It seems that the performance can be improved by using a larger lookup table, which requires 4KB to store ASCII codes from ‘000’ to ‘999’ and 4.94KB to store exp_dec values from 'e-324' to 'e308'. Please test the performance under different hardware to decide whether to adopt it.

yy_double copy2.txt

Mar 11 '25 05:03 xjb714

@ibireme Code error modification. Location: the write_u64_len_1_to_16 function

before:

u64 e10_tmp = e10_tmp_table[__builtin_clzll(val)];

after:

u64 e10_tmp = e10_tmp_table[__builtin_clzll(val|1)];

Mar 12 '25 04:03 xjb714

It seems that the performance can be improved by using a larger lookup table, which requires 4KB to store ASCII codes from ‘000’ to ‘999’ and 4.94KB to store exp_dec values from 'e-324' to 'e308'. Please test the performance under different hardware to decide whether to adopt it.

yy_double copy2.txt

This version performs really well on my M1, with almost all tests showing improvements. But on Intel, the results are similar to before, with some gains and some losses, so I'll need to dig into that a bit more. I believe with a bit more tweaking, this algorithm could perform really well.

I had only tested with a 40K-LUT before, like in the itoa_yy_largelut.c, but I feel a large LUT might not be acceptable to many people and could cause cache pressure on the L1d. A 4K-LUT seems like a better balance.

Mar 12 '25 17:03 ibireme

@ibireme The lookup table of pow10 is about 10KB, which is necessary, but in practical application, only 1KB or even less of 10KB may be used, and the decimal exponents of most values are around 0. The 4KB lookup table stores ascii values from 000 to 999, and only 3bytes out of 4 bytes are actually used for each value. The remaining 1byte seems to be able to store other information without affecting the final printed result, such as the number of leading zeros or trailing zeros, which I have not studied in detail. If we use a 3KB lookup table, it seems that there will be an overhead of multiplying by 3, which may be slower, so we use a 4KB lookup table. The L1dcache size of most CPU is 32KB. Using 40KB lookup table will cause L1 cache miss, but using 4KB lookup table will not. The 40KB lookup table is not suitable for L1 dcache equal to 32KB. The L1 dcache of APPLE M1 seems to be 64KB, and the performance of 40KB lookup table will be the best, which is a little better than that of 4KB lookup table. The code I sent before contains three lookup tables: 200B KB, 4KB and 40KB. 4KB lookup table is suitable for most CPU. The total lookup table space is about 10KB+200B+4KB+4.94KB by using 4KB lookup table, and the worst case is about 20KB, but the actual application may be 1KB+200B+4KB+0.5KB, less than 6KB. Will not exceed L1 dcache.

Mar 12 '25 17:03 xjb714

@ibireme A faster implementation.

yy_double copy3.txt

Mar 15 '25 16:03 xjb714

The test results of LJSON show that it is faster and more powerful than yyjson; And using the newly developed double-float number to string algorithm (ldouble), the speed is faster, and the lookup table can be as low as about 2KB.

Test Results:
- x86_64
- arm64

May 19 '25 03:05 lengjingzju

@lengjingzju Thanks. Your ldouble_convert source code contains two macro definitions, USING_U128_MUL and USING_HIGH_RESOLUTION. Can you explain? Whether it has an impact on the results or performance.It seems that the shortest significant numeric result will not be generated for the subnormal value.

May 19 '25 14:05 xjb714

@lengjingzju I gave your jnum_dtoa() a quick try, looks like it has a couple issues. Some numbers don't get the shortest form, like 1e-323 coming out as 9.88131291682493e-324; some lose precision, like 9.999999999999998 output 10.0.

I'd suggest testing against a dataset like this and comparing results with a well-tested library like double-conversion. That should help catch edge cases and make sure the output is correct.

May 19 '25 16:05 ibireme

Enabling USING_U128_MUL means using the GCC/Clang's 128-bits multiplication and division operations; otherwise, floating-point multiplication is used. -- The 128-bit method yields a more accurate value in the last digit.
Enabling USING_HIGH_RESOLUTION means using 16-digits decimal precision; otherwise, 15-digits decimal precision is used, where the last digit is an approximation.
Enabling USING_SMALL_LUT means using a smaller pow10 lookup table (about 2K Bytes); otherwise, the lookup table doubles in size.
“1e-323 coming out as 9.88131291682493e-324”, printf has the same performance.
“9.999999999999998 output 10.0”, because the last significant digit in this algorithm is a rounding number, it is specially processed to ensure that the number is the shortest.

The conversion from decimal to binary is inherently an approximate operation, and this algorithm does not guarantee the accuracy of the last digit. @xjb714 @ibireme

May 20 '25 00:05 lengjingzju

With the follow patch, 1e-323 coming out as 1e-323. https://github.com/lengjingzju/json/commit/e748d2d2c1829c6b80196d3d4fdb0f63dab90f4c

May 20 '25 05:05 lengjingzju

@lengjingzju Most modern double-to-string algorithms follow three principles from Steele and White Paper :

Accurate: Parsing the string back gives the original number (strtod(dtoa(num)) == num).
Shortest: The output uses the fewest digits possible without losing accuracy.
Correctly rounded: The output is the closest possible to the original number.

The C standard doesn't strictly define how floating-point formatting should work, so printf/sprintf behavior depends on the libc implementation. The printf("%.17g", num) typically ensures accuracy, but does not guarantee the shortest or correct rounding, so you can't fully rely on it.

For example, with the number 9.9999999999999982236431605997495353221893310546875:

9.999999999999998 (OK)
9.9999999999999982 (not shortest)
9.999999999999999 (not correctly rounded)
9.999999999999997 (not accurate)
10.0 (not accurate)

You can easily test this in Python or JavaScript, both follow these principles for double-to-string conversion.

May 20 '25 16:05 ibireme

@ibireme The three principles mentioned above are theoretically correct, but no double-to-string conversion algorithm can fully achieve them in practice.

A practical floating-point-to-string algorithm should meet the following principles:

Repeatedly performing dtoa(strtod(str)) should converge to a specific string.
After performing dtoa(strtod(str)), the result should be as close to the original str as possible, allowing for slight inaccuracies in the last digit (the 16th digit) due to floating-point precision limits.
Within the supported precision range, the converted string should be as short as possible.

If you need faster performance, you can choose ldouble. If you require higher precision in the 16th digit, you can opt for Dragonbox.

May 21 '25 04:05 lengjingzju

@lengjingzju It sounds like you.... didn't understand S&W criteria correctly?

no double-to-string conversion algorithm can fully achieve them in practice.

There absolutely are algorithms that fully achieve them. S&W themselves provided such one, and all of Ryu, Schubfach, Dragonbox do so too. Other modern algorithms as well, I suppose. Maybe you are thinking of string-to-double conversion, instead of double-to-string?

Accurate: Parsing the string back gives the original number (strtod(dtoa(num)) == num).

This just means what's written (provided that strtod is correct) so there is no way misunderstanding it I guess.

Shortest: The output uses the fewest digits possible without losing accuracy.

Here "losing accuracy" simply means that strtod(dtoa(num)) == num fails to hold, nothing more than that. Note that the true value of a binary floating-point number may have very long (several hundreds, excluding all leading/trailing zeros) number of digits when precisely written in decimal. But dtoa only needs to print out at most 17 digits for any IEEE-754 binary64 input num to satisfy strtod(dtoa(num)) == num. We don't call it "losing accuracy" even though there still are hundreds of digits that are not printed out. (S&W called these remaining digits "garbage digits"). As long as strtod(dtoa(num)) == num holds, there is no loss of accuracy, according to the definition of "loss of accuracy" in this context.

Correctly rounded: The output is the closest possible to the original number.

For some inputs num, there may be several possible output digits of dtoa that satisfy strtod(dtoa(num)) == num and of the shortest length. The correct rounding criterion above means that, when that's the case, then dtoa shall prefer the one that is closest to the true value represented by num.

Note that these criteria quite fundamentally don't allow users to control the number of digits printed, because that number is already dictated by the criteria and the given input.

It's possible to invent your own specs for compromising these two conflicting requirements, but there is no widely adopted "standard" or "natural" rules for doing so, as far as I can tell. There is not a lot of real demand for such a compromise as well, I think.

Also, it's worth mentioning that, if the output satisfying the S&W criteria has n digits, those n digits do not necessarily coincide with the correct output of printf configured to print out n digits.

May 21 '25 05:05 jk-jeon

I understand. actually, for double-float number, "9.1999999999999999" and "9.2" are equal, but printing as "9.2" is better.

with the patch https://github.com/lengjingzju/json/commit/086b84907cf6de05a26b335697f8dc5b3ce85a76 , now "9.999999999999998" is printed as "9.999999999999998".

When the APPROX_TAIL_CMP_VAL is set to 1, it can resolve the shortest representation but can not handle rounding problem (where the 16th digit is 1 or 9); when it's set to 0, the opposite occurs.

May 21 '25 10:05 lengjingzju

@lengjingzju If you understand that, then why did you say that it's impossible to fully achieve S&W's criteria in practice?

Maybe you misunderstood the correct rounding criterion? In S&W's context, the correct rounding criterion does not mean that the output is correctly rounded in the usual sense of rounding at a specific digit position. That usual sense of correct rounding does conflict with the no-information-loss & shortness criteria, which is why S&W-style algorithms do not produce the same output as printf even when they spit out the same number of digits.

May 21 '25 21:05 jk-jeon

I can implement a correctly rounded method, but the optimal approach for maximum speed needs to be carefully considered. The basic principle is as follows:

Let the binary significand be a, the additional term be b, the lookup multiplier be c, and the divisor be d. Then, the decimal significand (correctly rounded but not necessarily the shortest) is calculated as m = ((a + b) * c) / d
In reality, any value within the range n ( (((a + b) * c - c) / d) < n < (((a + b) * c + c) / d) ) is a valid decimal significand. The shortest decimal representation must be selected from this interval.

May 22 '25 00:05 lengjingzju