multithreading issues
I only recently downloaded the library and am not sure what the expected result is, but currently the benchmarks involving multiple threads do not show any improvement for me.
possibly related, in the following I get None for fn.cpustring(), so it seems the threading is not enabled? Calling fn.thread_enable() does not seem to enable it either, though.
import fast_numpy_loops as fn
fn.cpustring()
You have to run fn.initialize() first to get a result by fn.cpustring() it appears.
https://quansight.github.io/numpy-threading-extensions/stable/use.html
Thanks, the fn.cpustring does work once initialize has been called. Sorry about missing that in the docs.
Regarding performance, I do not expect all functions to benefit from multithreading, but thought perhaps some cases such as exp would. However if I run asv run -b UFunc_exp using the branch corresponding to #71, I get nearly identical performance regardless of the number of threads:
[ 0.00%] ·· Benchmarking virtualenv-py3.8-numpy
[ 16.67%] ··· Running (bench_ufunc.UFunc_exp.time_ufunc_types--)...
[ 66.67%] ··· bench_ufunc.UFunc_exp.time_ufunc_types ok
[ 66.67%] ··· ========== ============
nthreads
---------- ------------
0 27.6±0.2ms
2 27.4±0.3ms
4 27.5±0.3ms
========== ============
Is this consistent with what others are seeing?
cpustring: **CPU: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz AVX2:1 BMI2:1 0x7ffefbbf 0xbfebfbff 0xd39ffffb 0x00000000
I have learned a few more things (at least when testing on my computer).
- If hyperthreading is on (which we have begun to start detecting), we will only use every other core because when two threads run on the same core, it slow things down by about 10%.
- If the array size fits inside the L2 cache size (we can start returning L1/L2/L3 cache sizes in cpuinfo), and the array operation is simple and fast (like add two floats), there may be no speed up because the main thread is pulling L2 cache at top speed
- Some common array operations are not vectorized like multiplying int64 values
Below a little over 2x speed up (with 3 extra threads). int64 multiply is not vectorized
In [31]: pn.thread_disable()
In [32]: x=np.arange(100_000, dtype=np.int64)
In [33]: y=x.copy()
In [34]: c=x+y
In [35]: %timeit np.multiply(x,y,out=c)
87.4 µs ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [36]: pn.thread_enable()
In [37]: %timeit np.multiply(x,y,out=c)
35.6 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Now if I make larger arrays that are larger than L3 cache size.. here 100million * 4bytes per float - 400MB array size. Below a little over 2x speed up (with 3 extra threads) -- cache is blown
In [38]: x=np.arange(100_000_000, dtype=np.float32)
In [39]: y=x.copy()
In [40]: c=x+y
In [41]: %timeit np.add(x,y,out=c)
40.4 ms ± 82.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [42]: pn.thread_disable()
In [43]: %timeit np.add(x,y,out=c)
85.6 ms ± 3.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Below about 4x speed up (with 3 extra threads), inside L3 cache, but outside L2 cache on my computer
In [47]: pn.thread_enable()
In [48]: x=np.arange(1_000_000, dtype=np.float32); y=x.copy(); c=x+y
In [49]: %timeit np.add(x,y,out=c)
109 µs ± 755 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [50]: pn.thread_disable()
In [51]: %timeit np.add(x,y,out=c)
408 µs ± 4.98 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Below runs slower (made small arrays, and reused them)
In [66]: pn.thread_disable()
In [67]: x=np.arange(50_000, dtype=np.float32); y=x.copy(); c=x+y
In [68]: %timeit np.add(x,x,out=c)
14.6 µs ± 139 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [70]: pn.thread_enable()
In [71]: %timeit np.add(x,x,out=c)
11.9 µs ± 66.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
This also demonstrates the problem with simple benchmarks vs. an application. When using many smaller arguments that together do not fit in the L2 cache, breaking an application into smaller blocks may enable all the blocks of arguments to fit in the L2 cache of the various CPUs, providing a speedup. This strategy would be very complicated to execute in pnumpy as a numpy add-on, and would be easier to do in a framework like dask.
ping @jack-pappas to see if there is a way to get asv to reflect the performance increase when using multiple threads. Perhaps the benchmark should create an out result and then call the ufunc with res = ufunc(..., out=out)? Or is the 1024x1024 2D array not the right shape for pnumpy optimizations? In any case, once we figure out what is going on we should document the targeted use cases for pnumpy.
@mattip There is indeed -- I've just opened PR #107 with some changes to fix how threading is handled in the benchmarks.