xjb714

Results 33 comments of xjb714

English: I use icpx 2025.0.4 -O3 to compile my code. clang++ seems to be a little less efficient than icpx, and g++ seems to be even less efficient. Generally speaking,...

@ibireme Your f64_bin_to_dec function code can be optimized in this way. Testing on my CPU can probably reduce **3 to 4** cycle. ![Image](https://github.com/user-attachments/assets/3b29ad5c-d411-4bee-bcc6-cf40b1b20ab1)

@ibireme Thank you for your information. Your algorithm does perform better on most hardware and does not depend on a specific instruction set. Your algorithm is really ingenious, and I...

@ibireme Performance optimization of yy_double_to_string. The source code link is as follows. [yy_double copy.txt](https://github.com/user-attachments/files/19147418/yy_double.copy.txt) Improve the performance under random length test.

@ibireme There is a mistake in the above code. Use this file. [yy_double copy.txt](https://github.com/user-attachments/files/19147701/yy_double.copy.txt)

@ibireme. It seems that there is no problem to use the following code instead, and calculate the number of trailing zeros in the lower 16 digits. num8_1 ``` u64 tz1...

@ibireme This code achieves the same function and the compiled result has no branch instructions. The penalty of branch prediction failure can be avoid. click this link: https://godbolt.org/z/fEa5oGcjo

It seems that the performance can be improved by using a larger lookup table, which requires **4KB** to store ASCII codes from ‘000’ to ‘999’ and **4.94KB** to store exp_dec...

@ibireme Code error modification. Location: the **write_u64_len_1_to_16** function before: ``` u64 e10_tmp = e10_tmp_table[__builtin_clzll(val)]; ``` after: ``` u64 e10_tmp = e10_tmp_table[__builtin_clzll(val|1)]; ```

@ibireme The lookup table of pow10 is about 10KB, which is necessary, but in practical application, only 1KB or even less of 10KB may be used, and the decimal exponents...