fastapprox-rs
fastapprox-rs copied to clipboard
Use fused mul-add instructions where possible
The Rust compiler does not yet optimize FMA instructions in a majority of cases. Therefore, it is recommended to use the f32::mul_add
method to allow FMA instructions to be used. In some cases this can provide a significant speedup on machines with FMA available, and the fused mul-add instruction is reported to be more accurate than a manual floating point mul and add instruction.
Notable improvements include: 15% speedup on cos_fast 13% speedup on cos_faster 22% speedup on cosfull_fast 12% speedup on cosfull_faster 14% speedup on digamma_fast 62% speedup on erf_fast 12% speedup on erf_inv_fast 64% speedup on erfc_fast 20% speedup on exp_faster 10% speedup on lambertwexpx_fast and _faster 31% speedup on ln_gamma_fast 15% speedup on ln_gamma_faster 15% speedup on sin_fast 10% speedup on sin_faster 23% speedup on sinfull_fast 14% speedup on sinfull_faster 16% speedup on tan_fast 18% speedup on tan_faster 16% speedup on tanfull_fast 24% speedup on tanfull_faster
There is one notable regression which is pow_fast. Not really sure what's going on with that one...
Benchmarks before:
test cos_fast ... bench: 2,079 ns/iter (+/- 13)
test cos_faster ... bench: 1,036 ns/iter (+/- 7)
test cosfull_fast ... bench: 4,918 ns/iter (+/- 18)
test cosfull_faster ... bench: 3,433 ns/iter (+/- 10)
test cosh_fast ... bench: 8,592 ns/iter (+/- 23)
test cosh_faster ... bench: 2,715 ns/iter (+/- 13)
test digamma_fast ... bench: 3,101 ns/iter (+/- 6)
test digamma_faster ... bench: 1,830 ns/iter (+/- 2)
test digamma_special ... bench: 12,366 ns/iter (+/- 162)
test digamma_statrs ... bench: 13,919 ns/iter (+/- 65)
test erf_fast ... bench: 16,029 ns/iter (+/- 50)
test erf_faster ... bench: 1,723 ns/iter (+/- 7)
test erf_inv_fast ... bench: 3,383 ns/iter (+/- 20)
test erf_inv_faster ... bench: 1,552 ns/iter (+/- 5)
test erf_inv_statrs ... bench: 690 ns/iter (+/- 5)
test erf_special ... bench: 2,687 ns/iter (+/- 12)
test erf_statrs ... bench: 5,183 ns/iter (+/- 35)
test erfc_fast ... bench: 16,266 ns/iter (+/- 37)
test erfc_faster ... bench: 1,575 ns/iter (+/- 8)
test erfc_special ... bench: 5,200 ns/iter (+/- 19)
test exp_fast ... bench: 4,034 ns/iter (+/- 16)
test exp_faster ... bench: 1,333 ns/iter (+/- 13)
test lambertw_fast ... bench: 31,768 ns/iter (+/- 42)
test lambertw_faster ... bench: 17,991 ns/iter (+/- 42)
test lambertwexpx_fast ... bench: 10,370 ns/iter (+/- 15)
test lambertwexpx_faster ... bench: 4,436 ns/iter (+/- 22)
test ln_fast ... bench: 1,727 ns/iter (+/- 5)
test ln_faster ... bench: 774 ns/iter (+/- 2)
test ln_gamma_fast ... bench: 6,012 ns/iter (+/- 11)
test ln_gamma_faster ... bench: 1,967 ns/iter (+/- 5)
test ln_gamma_special ... bench: 10,542 ns/iter (+/- 124)
test ln_gamma_statrs ... bench: 19,344 ns/iter (+/- 42)
test log2_fast ... bench: 1,638 ns/iter (+/- 5)
test log2_faster ... bench: 775 ns/iter (+/- 4)
test pow2_fast ... bench: 3,760 ns/iter (+/- 6)
test pow2_faster ... bench: 976 ns/iter (+/- 10)
test pow_fast ... bench: 8,651 ns/iter (+/- 40)
test pow_faster ... bench: 2,662 ns/iter (+/- 9)
test sigmoid_fast ... bench: 4,824 ns/iter (+/- 7)
test sigmoid_faster ... bench: 1,857 ns/iter (+/- 5)
test sin_fast ... bench: 1,825 ns/iter (+/- 17)
test sin_faster ... bench: 1,011 ns/iter (+/- 3)
test sinfull_fast ... bench: 4,494 ns/iter (+/- 14)
test sinfull_faster ... bench: 3,130 ns/iter (+/- 9)
test sinh_fast ... bench: 8,270 ns/iter (+/- 421)
test sinh_faster ... bench: 2,707 ns/iter (+/- 9)
test tan_fast ... bench: 3,157 ns/iter (+/- 9)
test tan_faster ... bench: 1,983 ns/iter (+/- 5)
test tanfull_fast ... bench: 6,775 ns/iter (+/- 16)
test tanfull_faster ... bench: 4,956 ns/iter (+/- 21)
test tanh_fast ... bench: 5,683 ns/iter (+/- 34)
test tanh_faster ... bench: 2,281 ns/iter (+/- 6)
After:
test cos_fast ... bench: 1,768 ns/iter (+/- 22)
test cos_faster ... bench: 902 ns/iter (+/- 4)
test cosfull_fast ... bench: 3,818 ns/iter (+/- 9)
test cosfull_faster ... bench: 3,006 ns/iter (+/- 12)
test cosh_fast ... bench: 8,778 ns/iter (+/- 35)
test cosh_faster ... bench: 2,714 ns/iter (+/- 15)
test digamma_fast ... bench: 2,659 ns/iter (+/- 11)
test digamma_faster ... bench: 1,775 ns/iter (+/- 7)
test digamma_special ... bench: 12,466 ns/iter (+/- 73)
test digamma_statrs ... bench: 13,886 ns/iter (+/- 97)
test erf_fast ... bench: 6,062 ns/iter (+/- 58)
test erf_faster ... bench: 1,729 ns/iter (+/- 17)
test erf_inv_fast ... bench: 2,967 ns/iter (+/- 15)
test erf_inv_faster ... bench: 1,420 ns/iter (+/- 6)
test erf_inv_statrs ... bench: 689 ns/iter (+/- 3)
test erf_special ... bench: 2,682 ns/iter (+/- 14)
test erf_statrs ... bench: 5,311 ns/iter (+/- 182)
test erfc_fast ... bench: 5,834 ns/iter (+/- 24)
test erfc_faster ... bench: 1,572 ns/iter (+/- 17)
test erfc_special ... bench: 5,206 ns/iter (+/- 35)
test exp_fast ... bench: 3,921 ns/iter (+/- 12)
test exp_faster ... bench: 1,060 ns/iter (+/- 11)
test lambertw_fast ... bench: 34,830 ns/iter (+/- 59)
test lambertw_faster ... bench: 17,537 ns/iter (+/- 33)
test lambertwexpx_fast ... bench: 9,299 ns/iter (+/- 29)
test lambertwexpx_faster ... bench: 3,994 ns/iter (+/- 11)
test ln_fast ... bench: 1,584 ns/iter (+/- 6)
test ln_faster ... bench: 820 ns/iter (+/- 5)
test ln_gamma_fast ... bench: 4,125 ns/iter (+/- 19)
test ln_gamma_faster ... bench: 1,686 ns/iter (+/- 6)
test ln_gamma_special ... bench: 10,540 ns/iter (+/- 76)
test ln_gamma_statrs ... bench: 19,331 ns/iter (+/- 55)
test log2_fast ... bench: 1,482 ns/iter (+/- 6)
test log2_faster ... bench: 819 ns/iter (+/- 5)
test pow2_fast ... bench: 3,642 ns/iter (+/- 15)
test pow2_faster ... bench: 972 ns/iter (+/- 9)
test pow_fast ... bench: 9,579 ns/iter (+/- 29)
test pow_faster ... bench: 2,518 ns/iter (+/- 8)
test sigmoid_fast ... bench: 4,840 ns/iter (+/- 24)
test sigmoid_faster ... bench: 1,854 ns/iter (+/- 5)
test sin_fast ... bench: 1,557 ns/iter (+/- 6)
test sin_faster ... bench: 906 ns/iter (+/- 7)
test sinfull_fast ... bench: 3,461 ns/iter (+/- 19)
test sinfull_faster ... bench: 2,682 ns/iter (+/- 11)
test sinh_fast ... bench: 8,587 ns/iter (+/- 37)
test sinh_faster ... bench: 2,712 ns/iter (+/- 14)
test tan_fast ... bench: 2,644 ns/iter (+/- 7)
test tan_faster ... bench: 1,618 ns/iter (+/- 5)
test tanfull_fast ... bench: 5,671 ns/iter (+/- 13)
test tanfull_faster ... bench: 3,747 ns/iter (+/- 13)
test tanh_fast ... bench: 5,352 ns/iter (+/- 18)
test tanh_faster ... bench: 2,280 ns/iter (+/- 7)
Hi @shssoichiro! First of all, thank you for this contribution. It's amazing that you were able to make these computations even faster w/o changing the principles.
But I'd suggest putting the new code using f32::mul_add
into separate module to preserve full parity with the original C library for fast
and faster
modules. We can call the new module fused
(because why not). If you do not have time for implementing it yourself, I can take it from here.
Thanks again for bringing this PR, and sorry that it took me so long to review it.