laser
laser copied to clipboard
Exponential: Dual Xeon Gold 6154 result
cc @laurae2
Warmup: 0.9938 s, result 224 (displayed to avoid compiler optimizing warmup away)
A - tensor shape: [5000000]
Required number of operations: 5.000 millions
Required bytes: 20.000 MB
Arithmetic intensity: 0.250 FLOP/byte
Theoretical peak single-core: 86.400 GFLOP/s
Theoretical peak multi: 172.800 GFLOP/s
a[0]: -9.999997138977051
Baseline <math.h>
Collected 300 samples in 5.625 seconds
Average time: 17.733 ms
Stddev time: 0.054 ms
Min time: 17.698 ms
Max time: 18.148 ms
Perf: 0.282 GEXPOP/s
Display output[0] to make sure it's not optimized away
4.540005829767324e-05
SSE mathfun
Collected 300 samples in 2.094 seconds
Average time: 6.021 ms
Stddev time: 0.043 ms
Min time: 5.976 ms
Max time: 6.544 ms
Perf: 0.830 GEXPOP/s
Display output[0] to make sure it's not optimized away
4.540006193565205e-05
SSE fast_exp_sse (low order polynomial)
Collected 300 samples in 1.211 seconds
Average time: 3.073 ms
Stddev time: 0.062 ms
Min time: 3.019 ms
Max time: 3.734 ms
Perf: 1.627 GEXPOP/s
Display output[0] to make sure it's not optimized away
4.545032061287202e-05
AVX2 fmath
Collected 300 samples in 1.060 seconds
Average time: 2.558 ms
Stddev time: 0.067 ms
Min time: 2.473 ms
Max time: 3.056 ms
Perf: 1.955 GEXPOP/s
Display output[0] to make sure it's not optimized away
4.540006193565205e-05
AVX2 FMA Minimax
Collected 300 samples in 1.042 seconds
Average time: 2.492 ms
Stddev time: 0.076 ms
Min time: 2.383 ms
Max time: 3.050 ms
Perf: 2.006 GEXPOP/s
Display output[0] to make sure it's not optimized away
4.539992369245738e-05
AVX2 mathfun
Collected 300 samples in 1.307 seconds
Average time: 3.382 ms
Stddev time: 0.067 ms
Min time: 3.275 ms
Max time: 3.906 ms
Perf: 1.478 GEXPOP/s
Display output[0] to make sure it's not optimized away
4.540006193565205e-05
AVX+FMA Schraudolph-approx
Collected 300 samples in 0.933 seconds
Average time: 2.128 ms
Stddev time: 0.066 ms
Min time: 2.062 ms
Max time: 2.607 ms
Perf: 2.350 GEXPOP/s
Display output[0] to make sure it's not optimized away
4.625692963600159e-05
Bench SIMD Math Prims
Collected 300 samples in 4.826 seconds
Average time: 15.079 ms
Stddev time: 0.189 ms
Min time: 14.945 ms
Max time: 15.937 ms
Perf: 0.332 GEXPOP/s
Display output[0] to make sure it's not optimized away
4.539986548479646e-05
That's pretty good for exp. For log
it is usually helpful to take advantage of the fact that most CPUs can usually domain/range reduce for you to just one multiple of two. Specifically, any IEEE float (single, double, probably even half-precision) has an exponent field which tells you the logarithm base 2 up to the mantissa field. So, if yo are willing to assume IEEE float representation then you only actually need a good fast approximation of log_2
on [1/2,1)
or [1,2)
. (log base e
of course is just re-scaled log_2
).
In the past I've done the mantissa part with a mixture of Taylor series for near 1 and Padé approximants for further away. That got down to about 60 to 17 cycles on Intel core2 series from Nehalem to Skylake, respectively, for a 1e-7 relative error single precision log, anyway.
There may be a faster approach than Taylor/Padé, but it seemed very likely the "IEEE exponent field" trick would be applicable to whatever you do if you're willing to assume IEEE. I also perused your references and didn't see it mentioned. They seem more exp
-focused, but log
is generally more poorly optimized by various stdlibs/vendors in my experience.
(edit: clarified performance mention in time and accuracy)
Log focus is coming, I didn't get to that yet.
Well, when you do get to log, I suggest as a starting point these blog posts from some Ebay guy. {That material didn't exist back when I was figuring this out for myself :-) } :
https://www.ebayinc.com/stories/blogs/tech/fast-approximate-logarithms-part-i-the-basics/
https://www.ebayinc.com/stories/blogs/tech/fast-approximate-logarithms-part-ii-rounding-error/
https://www.ebayinc.com/stories/blogs/tech/fast-approximate-logarithms-part-iii-the-formulas/
He never really mentions Padé approximants, though his rational approximations are similar. He argues for a [3/4, 6/4) interval. (As mentioned, any [t,2t) works - whatever gets you a fast, well approximated "shifted mantissa".) Seems like if you have all this Intel-specific SIMD code, assuming IEEE (even little-endian IEEE) is not much of a stretch.
Particularly with log, there can be unavoidable speed/accuracy trade-offs. So, you may want to have a Nim call that accepts a target precision (and conceivably even returns an error estimate), and then just default it to 23 bits for float32 and 51 bits for float64 and let the call-site use a larger error when faster performance is desired and in-context lower accuracy is acceptable.