fastapprox-rs icon indicating copy to clipboard operation
fastapprox-rs copied to clipboard

Use fused mul-add instructions where possible

Open shssoichiro opened this issue 2 years ago • 1 comments

The Rust compiler does not yet optimize FMA instructions in a majority of cases. Therefore, it is recommended to use the f32::mul_add method to allow FMA instructions to be used. In some cases this can provide a significant speedup on machines with FMA available, and the fused mul-add instruction is reported to be more accurate than a manual floating point mul and add instruction.

Notable improvements include: 15% speedup on cos_fast 13% speedup on cos_faster 22% speedup on cosfull_fast 12% speedup on cosfull_faster 14% speedup on digamma_fast 62% speedup on erf_fast 12% speedup on erf_inv_fast 64% speedup on erfc_fast 20% speedup on exp_faster 10% speedup on lambertwexpx_fast and _faster 31% speedup on ln_gamma_fast 15% speedup on ln_gamma_faster 15% speedup on sin_fast 10% speedup on sin_faster 23% speedup on sinfull_fast 14% speedup on sinfull_faster 16% speedup on tan_fast 18% speedup on tan_faster 16% speedup on tanfull_fast 24% speedup on tanfull_faster

There is one notable regression which is pow_fast. Not really sure what's going on with that one...

Benchmarks before:

test cos_fast            ... bench:       2,079 ns/iter (+/- 13)
test cos_faster          ... bench:       1,036 ns/iter (+/- 7)
test cosfull_fast        ... bench:       4,918 ns/iter (+/- 18)
test cosfull_faster      ... bench:       3,433 ns/iter (+/- 10)
test cosh_fast           ... bench:       8,592 ns/iter (+/- 23)
test cosh_faster         ... bench:       2,715 ns/iter (+/- 13)
test digamma_fast        ... bench:       3,101 ns/iter (+/- 6)
test digamma_faster      ... bench:       1,830 ns/iter (+/- 2)
test digamma_special     ... bench:      12,366 ns/iter (+/- 162)
test digamma_statrs      ... bench:      13,919 ns/iter (+/- 65)
test erf_fast            ... bench:      16,029 ns/iter (+/- 50)
test erf_faster          ... bench:       1,723 ns/iter (+/- 7)
test erf_inv_fast        ... bench:       3,383 ns/iter (+/- 20)
test erf_inv_faster      ... bench:       1,552 ns/iter (+/- 5)
test erf_inv_statrs      ... bench:         690 ns/iter (+/- 5)
test erf_special         ... bench:       2,687 ns/iter (+/- 12)
test erf_statrs          ... bench:       5,183 ns/iter (+/- 35)
test erfc_fast           ... bench:      16,266 ns/iter (+/- 37)
test erfc_faster         ... bench:       1,575 ns/iter (+/- 8)
test erfc_special        ... bench:       5,200 ns/iter (+/- 19)
test exp_fast            ... bench:       4,034 ns/iter (+/- 16)
test exp_faster          ... bench:       1,333 ns/iter (+/- 13)
test lambertw_fast       ... bench:      31,768 ns/iter (+/- 42)
test lambertw_faster     ... bench:      17,991 ns/iter (+/- 42)
test lambertwexpx_fast   ... bench:      10,370 ns/iter (+/- 15)
test lambertwexpx_faster ... bench:       4,436 ns/iter (+/- 22)
test ln_fast             ... bench:       1,727 ns/iter (+/- 5)
test ln_faster           ... bench:         774 ns/iter (+/- 2)
test ln_gamma_fast       ... bench:       6,012 ns/iter (+/- 11)
test ln_gamma_faster     ... bench:       1,967 ns/iter (+/- 5)
test ln_gamma_special    ... bench:      10,542 ns/iter (+/- 124)
test ln_gamma_statrs     ... bench:      19,344 ns/iter (+/- 42)
test log2_fast           ... bench:       1,638 ns/iter (+/- 5)
test log2_faster         ... bench:         775 ns/iter (+/- 4)
test pow2_fast           ... bench:       3,760 ns/iter (+/- 6)
test pow2_faster         ... bench:         976 ns/iter (+/- 10)
test pow_fast            ... bench:       8,651 ns/iter (+/- 40)
test pow_faster          ... bench:       2,662 ns/iter (+/- 9)
test sigmoid_fast        ... bench:       4,824 ns/iter (+/- 7)
test sigmoid_faster      ... bench:       1,857 ns/iter (+/- 5)
test sin_fast            ... bench:       1,825 ns/iter (+/- 17)
test sin_faster          ... bench:       1,011 ns/iter (+/- 3)
test sinfull_fast        ... bench:       4,494 ns/iter (+/- 14)
test sinfull_faster      ... bench:       3,130 ns/iter (+/- 9)
test sinh_fast           ... bench:       8,270 ns/iter (+/- 421)
test sinh_faster         ... bench:       2,707 ns/iter (+/- 9)
test tan_fast            ... bench:       3,157 ns/iter (+/- 9)
test tan_faster          ... bench:       1,983 ns/iter (+/- 5)
test tanfull_fast        ... bench:       6,775 ns/iter (+/- 16)
test tanfull_faster      ... bench:       4,956 ns/iter (+/- 21)
test tanh_fast           ... bench:       5,683 ns/iter (+/- 34)
test tanh_faster         ... bench:       2,281 ns/iter (+/- 6)

After:

test cos_fast            ... bench:       1,768 ns/iter (+/- 22)
test cos_faster          ... bench:         902 ns/iter (+/- 4)
test cosfull_fast        ... bench:       3,818 ns/iter (+/- 9)
test cosfull_faster      ... bench:       3,006 ns/iter (+/- 12)
test cosh_fast           ... bench:       8,778 ns/iter (+/- 35)
test cosh_faster         ... bench:       2,714 ns/iter (+/- 15)
test digamma_fast        ... bench:       2,659 ns/iter (+/- 11)
test digamma_faster      ... bench:       1,775 ns/iter (+/- 7)
test digamma_special     ... bench:      12,466 ns/iter (+/- 73)
test digamma_statrs      ... bench:      13,886 ns/iter (+/- 97)
test erf_fast            ... bench:       6,062 ns/iter (+/- 58)
test erf_faster          ... bench:       1,729 ns/iter (+/- 17)
test erf_inv_fast        ... bench:       2,967 ns/iter (+/- 15)
test erf_inv_faster      ... bench:       1,420 ns/iter (+/- 6)
test erf_inv_statrs      ... bench:         689 ns/iter (+/- 3)
test erf_special         ... bench:       2,682 ns/iter (+/- 14)
test erf_statrs          ... bench:       5,311 ns/iter (+/- 182)
test erfc_fast           ... bench:       5,834 ns/iter (+/- 24)
test erfc_faster         ... bench:       1,572 ns/iter (+/- 17)
test erfc_special        ... bench:       5,206 ns/iter (+/- 35)
test exp_fast            ... bench:       3,921 ns/iter (+/- 12)
test exp_faster          ... bench:       1,060 ns/iter (+/- 11)
test lambertw_fast       ... bench:      34,830 ns/iter (+/- 59)
test lambertw_faster     ... bench:      17,537 ns/iter (+/- 33)
test lambertwexpx_fast   ... bench:       9,299 ns/iter (+/- 29)
test lambertwexpx_faster ... bench:       3,994 ns/iter (+/- 11)
test ln_fast             ... bench:       1,584 ns/iter (+/- 6)
test ln_faster           ... bench:         820 ns/iter (+/- 5)
test ln_gamma_fast       ... bench:       4,125 ns/iter (+/- 19)
test ln_gamma_faster     ... bench:       1,686 ns/iter (+/- 6)
test ln_gamma_special    ... bench:      10,540 ns/iter (+/- 76)
test ln_gamma_statrs     ... bench:      19,331 ns/iter (+/- 55)
test log2_fast           ... bench:       1,482 ns/iter (+/- 6)
test log2_faster         ... bench:         819 ns/iter (+/- 5)
test pow2_fast           ... bench:       3,642 ns/iter (+/- 15)
test pow2_faster         ... bench:         972 ns/iter (+/- 9)
test pow_fast            ... bench:       9,579 ns/iter (+/- 29)
test pow_faster          ... bench:       2,518 ns/iter (+/- 8)
test sigmoid_fast        ... bench:       4,840 ns/iter (+/- 24)
test sigmoid_faster      ... bench:       1,854 ns/iter (+/- 5)
test sin_fast            ... bench:       1,557 ns/iter (+/- 6)
test sin_faster          ... bench:         906 ns/iter (+/- 7)
test sinfull_fast        ... bench:       3,461 ns/iter (+/- 19)
test sinfull_faster      ... bench:       2,682 ns/iter (+/- 11)
test sinh_fast           ... bench:       8,587 ns/iter (+/- 37)
test sinh_faster         ... bench:       2,712 ns/iter (+/- 14)
test tan_fast            ... bench:       2,644 ns/iter (+/- 7)
test tan_faster          ... bench:       1,618 ns/iter (+/- 5)
test tanfull_fast        ... bench:       5,671 ns/iter (+/- 13)
test tanfull_faster      ... bench:       3,747 ns/iter (+/- 13)
test tanh_fast           ... bench:       5,352 ns/iter (+/- 18)
test tanh_faster         ... bench:       2,280 ns/iter (+/- 7)

shssoichiro avatar Nov 15 '22 09:11 shssoichiro

Hi @shssoichiro! First of all, thank you for this contribution. It's amazing that you were able to make these computations even faster w/o changing the principles.

But I'd suggest putting the new code using f32::mul_add into separate module to preserve full parity with the original C library for fast and faster modules. We can call the new module fused (because why not). If you do not have time for implementing it yourself, I can take it from here.

Thanks again for bringing this PR, and sorry that it took me so long to review it.

loony-bean avatar Aug 27 '23 19:08 loony-bean