fast-math
fast-math copied to clipboard
Implement fast tanh
Partially adresses #1.
Codecov Report
Merging #6 into master will increase coverage by
0.36%
. The diff coverage is95.69%
.
@@ Coverage Diff @@
## master #6 +/- ##
==========================================
+ Coverage 94.35% 94.72% +0.36%
==========================================
Files 5 6 +1
Lines 248 341 +93
==========================================
+ Hits 234 323 +89
- Misses 14 18 +4
Impacted Files | Coverage Δ | |
---|---|---|
src/lib.rs | 0% <ø> (ø) |
:arrow_up: |
src/tanh.rs | 95.69% <95.69%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 704ae2e...f0f0549. Read the comment docs.
I tried the implementations you suggested:
Current implementation:
scalar/tanh/baseline time: [6.2435 ns 6.2553 ns 6.2696 ns]
scalar/tanh/raw time: [78.785 ns 78.925 ns 79.074 ns]
scalar/tanh/full time: [101.70 ns 101.85 ns 102.01 ns]
scalar/tanh/std time: [362.90 ns 363.99 ns 365.11 ns]
vector/tanh/baseline time: [4.8044 ns 4.8112 ns 4.8177 ns]
vector/tanh/raw time: [79.245 ns 79.367 ns 79.482 ns]
vector/tanh/full time: [99.708 ns 99.943 ns 100.22 ns]
vector/tanh/std time: [365.35 ns 366.00 ns 366.72 ns]
Suggested clipping:
scalar/tanh/baseline time: [6.2427 ns 6.2547 ns 6.2670 ns]
scalar/tanh/raw time: [81.526 ns 81.854 ns 82.318 ns]
scalar/tanh/full time: [96.473 ns 96.680 ns 96.911 ns]
scalar/tanh/std time: [359.96 ns 360.61 ns 361.29 ns]
vector/tanh/baseline time: [4.8269 ns 4.8392 ns 4.8526 ns]
vector/tanh/raw time: [86.026 ns 86.170 ns 86.317 ns]
vector/tanh/full time: [96.642 ns 96.800 ns 96.959 ns]
vector/tanh/std time: [356.28 ns 357.21 ns 358.32 ns]
It looks like there might be a small improvement, but the results are weird (with unexpected changes for tanh/raw
and tanh/std
, which were not changed). I'm not sure it's worth introducing the discontinuity.
Suggest clipping with optimized lower-order approximation:
scalar/tanh/baseline time: [6.1683 ns 6.1784 ns 6.1889 ns]
scalar/tanh/raw time: [37.826 ns 37.905 ns 37.987 ns]
scalar/tanh/full time: [33.853 ns 33.925 ns 34.004 ns]
scalar/tanh/std time: [386.83 ns 387.95 ns 389.08 ns]
vector/tanh/baseline time: [4.8231 ns 4.8306 ns 4.8384 ns]
vector/tanh/raw time: [9.4154 ns 9.4496 ns 9.4851 ns]
vector/tanh/full time: [10.176 ns 10.203 ns 10.235 ns]
vector/tanh/std time: [356.60 ns 357.30 ns 358.03 ns]
This seems to result in good performance improvements (54% for scalar/tanh/raw
, and 90% for the vectorized code).
exp
-based implementation:
scalar/tanh/baseline time: [6.2289 ns 6.2519 ns 6.2815 ns]
scalar/tanh/raw time: [38.523 ns 38.623 ns 38.736 ns]
scalar/tanh/full time: [38.654 ns 38.774 ns 38.902 ns]
scalar/tanh/std time: [360.46 ns 361.12 ns 361.82 ns]
vector/tanh/baseline time: [4.8622 ns 4.8726 ns 4.8830 ns]
vector/tanh/raw time: [10.374 ns 10.434 ns 10.518 ns]
vector/tanh/full time: [13.021 ns 13.056 ns 13.093 ns]
vector/tanh/std time: [363.48 ns 366.64 ns 371.14 ns]
This is a bit slower than the truncated continued fraction.
Should I switch to the implementation optimized for 0.0057 error tolerance?