fast-math icon indicating copy to clipboard operation
fast-math copied to clipboard

Implement fast tanh

Open vks opened this issue 6 years ago • 3 comments

Partially adresses #1.

vks avatar Jan 08 '19 12:01 vks

Codecov Report

Merging #6 into master will increase coverage by 0.36%. The diff coverage is 95.69%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master       #6      +/-   ##
==========================================
+ Coverage   94.35%   94.72%   +0.36%     
==========================================
  Files           5        6       +1     
  Lines         248      341      +93     
==========================================
+ Hits          234      323      +89     
- Misses         14       18       +4
Impacted Files Coverage Δ
src/lib.rs 0% <ø> (ø) :arrow_up:
src/tanh.rs 95.69% <95.69%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 704ae2e...f0f0549. Read the comment docs.

codecov-io avatar Jan 13 '19 00:01 codecov-io

I tried the implementations you suggested:

Current implementation:

scalar/tanh/baseline    time:   [6.2435 ns 6.2553 ns 6.2696 ns]                                  
scalar/tanh/raw         time:   [78.785 ns 78.925 ns 79.074 ns]                            
scalar/tanh/full        time:   [101.70 ns 101.85 ns 102.01 ns]                             
scalar/tanh/std         time:   [362.90 ns 363.99 ns 365.11 ns]                            

vector/tanh/baseline    time:   [4.8044 ns 4.8112 ns 4.8177 ns]                                  
vector/tanh/raw         time:   [79.245 ns 79.367 ns 79.482 ns]                            
vector/tanh/full        time:   [99.708 ns 99.943 ns 100.22 ns]                             
vector/tanh/std         time:   [365.35 ns 366.00 ns 366.72 ns]                            

Suggested clipping:

scalar/tanh/baseline    time:   [6.2427 ns 6.2547 ns 6.2670 ns]                                  
scalar/tanh/raw         time:   [81.526 ns 81.854 ns 82.318 ns]                            
scalar/tanh/full        time:   [96.473 ns 96.680 ns 96.911 ns]                             
scalar/tanh/std         time:   [359.96 ns 360.61 ns 361.29 ns]                            

vector/tanh/baseline    time:   [4.8269 ns 4.8392 ns 4.8526 ns]                                  
vector/tanh/raw         time:   [86.026 ns 86.170 ns 86.317 ns]                            
vector/tanh/full        time:   [96.642 ns 96.800 ns 96.959 ns]                             
vector/tanh/std         time:   [356.28 ns 357.21 ns 358.32 ns]

It looks like there might be a small improvement, but the results are weird (with unexpected changes for tanh/raw and tanh/std, which were not changed). I'm not sure it's worth introducing the discontinuity.

Suggest clipping with optimized lower-order approximation:

scalar/tanh/baseline    time:   [6.1683 ns 6.1784 ns 6.1889 ns]                                  
scalar/tanh/raw         time:   [37.826 ns 37.905 ns 37.987 ns]                             
scalar/tanh/full        time:   [33.853 ns 33.925 ns 34.004 ns]                              
scalar/tanh/std         time:   [386.83 ns 387.95 ns 389.08 ns]                            

vector/tanh/baseline    time:   [4.8231 ns 4.8306 ns 4.8384 ns]                                  
vector/tanh/raw         time:   [9.4154 ns 9.4496 ns 9.4851 ns]                             
vector/tanh/full        time:   [10.176 ns 10.203 ns 10.235 ns]                              
vector/tanh/std         time:   [356.60 ns 357.30 ns 358.03 ns]                            

This seems to result in good performance improvements (54% for scalar/tanh/raw, and 90% for the vectorized code).

exp-based implementation:

scalar/tanh/baseline    time:   [6.2289 ns 6.2519 ns 6.2815 ns]                                  
scalar/tanh/raw         time:   [38.523 ns 38.623 ns 38.736 ns]                             
scalar/tanh/full        time:   [38.654 ns 38.774 ns 38.902 ns]                              
scalar/tanh/std         time:   [360.46 ns 361.12 ns 361.82 ns]                            

vector/tanh/baseline    time:   [4.8622 ns 4.8726 ns 4.8830 ns]                                  
vector/tanh/raw         time:   [10.374 ns 10.434 ns 10.518 ns]                             
vector/tanh/full        time:   [13.021 ns 13.056 ns 13.093 ns]                              
vector/tanh/std         time:   [363.48 ns 366.64 ns 371.14 ns]                            

This is a bit slower than the truncated continued fraction.

vks avatar Mar 06 '19 17:03 vks

Should I switch to the implementation optimized for 0.0057 error tolerance?

vks avatar Mar 06 '19 18:03 vks