fmt A faster double to string algorithm

The following code only contains float/double to decimal and does not contain decimal to string. Algorithm source code and demonstration link : https://onlinegdb.com/OPKdOpikG This algorithm is based on the schubfach algorithm and is faster than the dragonbox algorithm. Thank you.

Oct 25 '25 19:10 xjb714

If you are interested, I will publish a paper introducing the principle of the algorithm.

Oct 25 '25 19:10 xjb714

Thanks for sharing. Do I understand correctly that you are the author of this implementation? I would recommend adding it to https://github.com/fmtlib/dtoa-benchmark first so that it can be compared to other algorithms. Paper describing the algorithm would definitely be useful for those interested in using it.

Oct 26 '25 00:10 vitaut

Yes, I'm the original author. Code link: https://github.com/xjb714/xjb Proof of the correctness of the above-mentioned linking algorithm:

fast_f2d.pdf

Algorithm principle process: 算法原理流程:

float can be proved through exhaustive testing and has passed the test. double requires strict mathematical proof and has passed the random number test. This is the draft of my thesis for your reference. Due to my limited ability, it may not be well-written. Please understand. My English is not very good. Some of the content was translated by a translation software. float可通过穷举测试证明，已通过测试，double需经过严格数学证明，已通过随机数测试。这是我写的论文草稿，供大家参考，水平有限，可能写的不好，敬请谅解。我英语不是很好，部分内容由翻译软件翻译。

Regarding the performance aspect, the gcc compiler may have exceptions when measuring the performance of float (the median value in the following figure is 7.84), which is significantly lower than that of icpx and clang. After reviewing the assembly code, I found that the functions in the loop have no inlining and contain parts that seriously affect performance, such as "call memcpy" It might be necessary to modify the benchmark program. Due to my limited time and proficiency, I don't know how to solve it for the time being. 关于性能部分，gcc编译器测量float性能时可能存在异常(下图中值为7.84)，相比icpx和clang有明显差距，我查看汇编代码后发现循环中函数没有内联，且包含“call memcpy“等会严重性能的部分，可能需要修改benchmark程序，我时间和水平有限，暂时不知道怎么解决。

Post a performance test result diagram of "float/double to decimal" : 贴一个"float/double to decimal"性能测试结果图：

Thanks to the authors of yyjson, dragonbox algorithm, schubfach algorithm, grisu algorithm and others. The principle and inspiration of this algorithm come from these several algorithms. I hope the code I provide can be helpful to everyone. 感谢yyjson的作者、dragonbox算法的作者、schubfach算法的作者、grisu算法的作者等人。本算法原理和灵感来自于这几个算法，希望我提供的代码能够帮助到大家。

Thank you for reading. 感谢阅读。

Oct 26 '25 05:10 xjb714

In fact, a complete floating-point number printing algorithm should consist of two processes: (1) binary to decimal (2) decimal to string. The above algorithm does not include the second step. The second step still poses certain challenges, with a large number of branch prediction failures and costly multiplication operations, which take up a considerable amount of time for floating-point number printing.

Oct 26 '25 05:10 xjb714

cc @jk-jeon in case you find this interesting

Oct 28 '25 04:10 vitaut

@vitaut Thanks. I'm aware of this and planning to look into it later.

I didn't do a thorough review, but I think the algorithm is quite likely correct.

Oct 28 '25 05:10 jk-jeon

@xjb714 could you submit a PR to add your algorithm to https://github.com/fmtlib/dtoa-benchmark?

Nov 04 '25 00:11 vitaut

@vitaut I'm currently finishing the code writing. I guess it will take a few more days.

I have some questions: There should be a theoretical optimal solution for converting floating-point numbers to the shortest string, but it seems that the optimal solution is not very readable in some cases.

There are two ways to print floating-point numbers :%lf and %le. %lf is output in the common way, and %le is output in the scientific notation form. example: The output results corresponding to the double value 123.45 are "123.45" and "1.2345e2". At this point, the output length of %lf is shorter than that of %le.

Give priority to outputting the shorter option between %lf and %le. If the lengths of %lf and %le are the same, choose %lf.

Suppose the floating-point number v is converted to shortest decimal result, which satisfies the S&W algorithm. d has the shortest significant figure length and its value is a. Here, len(d) is the decimal significant figure length for calculating d.

$$ v=c \cdot 2^q \Rightarrow d \cdot 10^k,d\in N^+ ; a=len(d) ; b=k+\lfloor log_{10}(d) \rfloor; $$ In fact, the following formula can be used to determine whether to choose to output in the %lf format. $$ a+1+(b>=4)+(b>=10)>=abs(b) \text{ and } -3<=b \text{ and } b<=20 $$

Here are a few examples:

double	%lf	%le	a	b	shortest
123.45	"123.45"	"1.2345e2"	5	2	%lf
1000	"1000"	"1e3"	1	3	%le
123	"123"	"1.23e2"	3	2	%lf
123000	"123000"	"1.23e5"	3	5	%lf
1.2e23	"12000...0"	"1.2e23"	2	23	%le
1e100	"1000...0"	"1e100"	1	100	%le
0.0123	"0.0123"	"1.23e-2"	3	-2	%lf
0.001	"0.001"	"1e-3"	1	-3	%le

For instance, the result "1000" in the %lf format of 1000 is more readable than the result "1e3" in the %le format (I think most people would think so as well).But the length of %le is less than that of %lf.

Different algorithms select %lf and %le based on different strategies. How should I choose to format the output in %lf or %le?

Nov 04 '25 05:11 xjb714

@xjb714 Welcome, it's yet another can of worms in this business.

Milo's dtoa benchmark assumes that the "correct" output must be the one with the smallest number of characters. Many JSON libraries written in that era seems to follow that practice.
The current C++ standard specifies that the default option of std::format and std::to_chars must print the string with the smallest number of characters, except that its meaning of "smallest" is subtly different from that of Milo's. IIRC it assumes that when exponential form is used, the exponent always needs to include the sign and also must be consisting of at least two digits. So for instance 123 is supposed to be 1.23e+02 when it were to be printed in the exponent format. 1.23e2 is not allowed.
On the other hand, fmt::format has never worked that way. It chooses fixed-point vs exponential solely based on the value of the decimal exponent, regardless of which one results in a shorter string. IIRC the exact threshold has been changed one or more times over the history.
Finally, Victor and I recently wrote a paper that proposes to change the specification of std::format and std::to_chars to match the behavior of fmt::format, which is arguably more natural and intuitive.

As a side note, the current specification of std::format/std::to_chars has a funny side effect of mandating "garbage digits" to be printed in certain circumstances which has noticeable performance implication. You can read more about that in the linked paper.

TL;DR Unless you have a strong reason not to, I think you should use fixed-point when exponent is small, and exponential when exponent is large.

Choosing between them based on the character length is wrong in my humble opinion, especially when garbage digits are involved. I recommend the linked paper for reading which explains some reasons why we prefer this option.

Nov 04 '25 06:11 jk-jeon

@xjb714 Ah, btw I just noticed that you are using Alexander Bolz's implementation of Dragonbox. I was wondering why you are not including Dragonbox for the float benchmark and now I see the reason.

That implementation is based on an older version of Dragonbox which is a bit slower than the current one. I don't think the current version will outperform Yaoyuan Guo's or yours, but in any case there is no reason to test against an outdated version.

Nov 04 '25 06:11 jk-jeon

@vitaut I have implemented the first version. I have submitted a PR to https://github.com/fmtlib/dtoa-benchmark.

I haven't tested whether it can run under the MSVC compiler. Currently, it can be compiled and run normally under the icpx, clang, and gcc compilers. However, the performance under the clang compiler is relatively poor, and the cause is being investigated. The performance is the best under the icpx compiler, followed by the gcc compiler, and the worst under the clang compiler.

In https://github.com/xjb714/dtoa-benchmark-fmtlib you can also get the code, the implementation of the git checkout v2 switch to v2 branch.

Nov 11 '25 19:11 xjb714

@xjb714, thank you!

Nov 12 '25 19:11 vitaut

I've merged https://github.com/fmtlib/dtoa-benchmark/pull/4 and the results look very promising. One question regarding

double requires strict mathematical proof and has passed the random number test

Does it mean that the correctness for double hasn't been proven yet?

Nov 15 '25 15:11 vitaut

@vitaut Floating-point number printing involves two steps: (1) binary to decimal. According to the schubfach algorithm, there are a total of four candidate values for the first step. The core of this algorithm is to quickly determine which candidate value to select by using equivalent conditions. It is introduced in the paper. (2) decimal to string. Regarding the second step, please refer to the code implementation. First, calculate the ASCII value, calculate the number of trailing zeros to obtain the number of valid values, and then delete the invalid bits. In the code implementation, the neon and sse2 instruction sets are adopted to accelerate the calculation of ASCII values. And try to minimize the failure of branch prediction.

In fact, this algorithm also supports AVX2 and AVX512 implementations on x64. Through features such as simd and instruction parallelism, it maximizes the utilization of CPU performance. However, its application scenarios are limited, and it may only be necessary to utilize this implementation when a large number of floating-point numbers need to be printed. The expected function interface is xjb64_32(const double* v,char* buf,int* len);. The length of buf should be 32x32=1024 bytes. Print 32 floating-point numbers and return the string length information to the len array. The function and the for(int i=0;i<32;++i)len[i]=xjb64(v[i],buf+32*i)-(buf+32*i);.The execution process is equal, but the performance is several times faster. The relevant code implementation might be rather complex and have very limited universality. If I have time later, I'll write a demo for you to see.

Nov 15 '25 16:11 xjb714

@xjb714, I've started looking at the paper and I think the $\frac{3}{4}$ factor should be under logarithm in equation (10), i.e. it should be $\lg(\frac{3}{4} \cdot 2^q)$, not $\frac{3}{4} \cdot q \cdot \lg(2)$. Then it will match equation (11) where 131237 >> 20 is a fixed-point approximation of $lg(\frac{3}{4})$ if I understand correctly.

(It's actually a bit more complicated than what I initially wrote - I corrected the formula in my comment.)

Nov 15 '25 19:11 vitaut

@vitaut You are reading an outdated document. There are many typos in the article. The latest document on https://github.com/xjb714/xjb, please see the latest document.

Nov 15 '25 19:11 xjb714

Great, thank you! I do see my comment addressed in the new revision.

Nov 15 '25 19:11 vitaut

@xjb714, would you be willing to contribute your implementation under the {fmt} license? If yes then we can add it as an option (behind a configuration flag) in {fmt} especially since some of the tables can be reused between this algorithm and Dragonbox. And of course we'll credit/refer to your project from the {fmt} code.

Nov 15 '25 22:11 vitaut

@vitaut Sure, you can use this algorithm for free, but the code is still being optimized. The goal is to be compatible with mainstream compilers such as msvc, gcc, icpx, and clang, and to be compatible with x64 and arm platforms on hardware. The first version implementation a few days ago was not the final version. The lookup tables adopted by dragonbox_comp and dragonbox_full should also be applicable to my algorithm. Please study them. Other lookup tables are also used in my algorithm. Of course, there are also implementations that do not rely on lookup tables, but they might run a little slower. I have also written the printing algorithm for 32-bit floating-point numbers, but I haven't uploaded it yet. Currently, for the implementation of double-precision floating-point number printing, the performance under the clang compiler lags significantly behind that of icpx, and the performance under gcc is also slightly behind icpx. It may be necessary to study its assembly code to know the reason. The runtime ratio of icpx, gcc and Clang compilers is approximately 9:12:17. The main factor affecting performance is that the gcc compiler compiles certain statements into branches rather than cmov instructions. The ipc value compiled by clang is very low. I'm studying the reasons and it might take a relatively long time.

Nov 16 '25 01:11 xjb714

Related discussion: https://github.com/jk-jeon/dragonbox/issues/73

Nov 16 '25 05:11 vitaut

BTW if the algorithm is based on @ibireme's yy_double which seems to be the case according to https://github.com/ibireme/yyjson/issues/237, I think it should be acknowledged in the paper and the differences or other contributions (e.g. theoretical proofs if any) pointed out.

Nov 16 '25 15:11 vitaut

@vitaut The source code and documentation have been updated.https://github.com/xjb714/xjb

Nov 19 '25 04:11 xjb714

@vitaut Algorithm correctness proof code. float:https://github.com/xjb714/f2dec_bench/blob/main/main.cpp#L488 void check_all_float_number(); double:https://github.com/xjb714/f2dec_bench/blob/main/main.cpp#L612 void check_double();

Nov 20 '25 06:11 xjb714

Hello, everyone. Regarding the code implementation of the double to string algorithm I proposed, a certain line of statement will cause a significant performance degradation under the clang compiler and x64 architecture, approximately from 9-10ns to 17-18ns. The source code location is as follows. https://github.com/xjb714/xjb/blob/main/dtoa_xjb64_xjb32.cpp#L2145. *(u64*)buf = exp_result; Commenting out this line of code can yield approximately the same performance as the icpx compiler, that is, 9 to 10ns. This is merely an instruction written to memory. Why has the performance dropped so significantly? I want to know the reason. However, the icpx compiler and the gcc compiler do not have this problem. Can anyone know the reason for this? How to solve the problem of low performance under the clang compiler.

Nov 23 '25 18:11 xjb714

Not sure about the perf but this cast is also a UB. The portable way to do this without UB is to use memcpy which should be optimized similarly.

Nov 23 '25 19:11 vitaut

double:https://github.com/xjb714/f2dec_bench/blob/main/main.cpp#L612 void check_double();

It's nice to test this but I don't think it can qualify as a proof because it only checks a subset of doubles. An exhaustive test that could prove correctness would be prohibitively expensive.

Nov 23 '25 19:11 vitaut