pnumpy
pnumpy copied to clipboard
Parallel NumPy seamlessly speeds up NumPy for large arrays (64K+ elements) with no change required to existing code.
In numpy, I notice numpy intrinsics being checking by xiegengxin [xiegengxin](https://github.com/numpy/numpy/commits?author=xiegengxin ) However, it is difficult to tell which routines are best because they are also checked in tis project....
At work, in BIOS we disable hyperthreading and turn off the power saving CPU C states. If we have multiple NUMA nodes, we sometimes run one process on NUMA node...
I tried to make a signature of two inputs:int32, int32 returning output; float64 and this failed. I notice some inputs like int16 return float32. On sqrt, some inputs, like int8,...
Need to experiment more with pragma and compiler targets. We want the loader code to be compiled normally. We want the avx2 (256bit instructions) code to be compiled with -avx2...
we can thread ufuncs we do not understand. For a binary_reduce, on a large array, we can divide the work up assigning each work chunk to a thread. each work...
need to see if we can tell numpy loop matching engine that we can directly compare int64 to uint64. (maybe it already works, just have not tested it yet)