pnumpy Create 'benchmarking' section of documentation

Per @mattip, create a 'benchmarking' page in the documentation. The page should include the following information:

instructions on how to set up and run the benchmark suite (via asv)
parameterization of the benchmark suite; in other words, what performance aspects are we looking to understand through the benchmarks? -- and how are we parameterizing the benchmarks to gather information to understand these?
wish list -- what things would we like to cover in the benchmark suite which we're currently not?
references / links to external materials: https://github.com/Quansight/numpy-threading-extensions/pull/107#discussion_r544365645_

Dec 17 '20 19:12 jack-pappas

I ran the benchmarks on a intel machine after running sudo pyperf system tune, but did not see any improvement when activating multiple threads. Here is the machine.json and the compressed .asv/results directory.

{
    "arch": "x86_64",
    "cpu": "Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz",
    "machine": "benchmarker",
    "num_cpu": "8",
    "os": "Linux 4.15.0-74-generic",
    "ram": "65748452",
    "version": 1
}

benchmarker.tar.gz

Dec 18 '20 13:12 mattip

The benchmarks ran for 2 hours on this machine

Dec 18 '20 13:12 mattip

@jack-pappas @tdimitri: any thoughts why I do not see an improvement?

Jan 06 '21 11:01 mattip

Matti, did you do...

pn.init()
pn.benchmark()

What are the numbers returned? Then now there is a parallel lexsort and a parallel sort.

Jan 06 '21 15:01 tdimitri

No, I followed the instructions on the benchmarks README

asv run

Here is my result for pn.benchmark():

>>> pn.benchmark()
1000000 rows,bool,int8,int16,int32,int64,float32,float64,
a==b , 0.99, 1.00, 1.00, 1.15, 1.01, 1.15, 1.02,
a==5 , 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.02,
a+b, 1.01, 1.00, 1.00, 1.06, 1.01, 0.97, 1.00,
a+5, 1.13, 1.00, 1.01, 1.00, 1.07, 1.02, 1.05,
a/5, 1.00, 1.00, 1.00, 0.99, 1.00, 1.00, 1.00,
abs, 1.00, 1.00, 1.00, 0.93, 0.98, 1.00, 1.08,
isnan, 1.00, 1.01, 1.01, 1.00, 1.01, 1.02, 0.99,
sin, 1.00, 0.99, 1.00, 1.00, 1.00, 0.98, 1.00,
log, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00,
sum, 1.00, 1.00, 1.00, 1.00, 1.02, 1.00, 1.02,
min, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00,

Jan 06 '21 15:01 mattip

Ahh, hangon, after pn.init() it gets better:

>>> pn.init()
>>> pn.benchmark()
1000000 rows,bool,int8,int16,int32,int64,float32,float64,
a==b , 6.79, 2.58, 2.59, 3.29, 6.67, 2.45, 6.14,
a==5 , 4.71, 1.81, 1.87, 3.00, 4.69, 1.97, 2.64,
a+b, 9.37, 2.31, 2.46, 3.14, 9.44, 2.89, 9.20,
a+5, 4.12, 2.33, 2.16, 2.75, 4.23, 1.85, 4.78,
a/5, 0.72, 0.86, 0.87, 0.91, 0.70, 4.08, 6.99,
abs, 4.02, 5.83, 6.53, 3.16, 4.00, 9.85,11.18,
isnan, 0.79, 0.70, 0.80, 0.74, 0.80, 1.96, 2.73,
sin, 4.30, 3.88, 3.95, 8.81, 5.32,21.15,60.16,
log, 1.25, 2.13, 2.17, 1.30, 1.58, 6.39, 3.05,
sum, 8.28, 1.01, 1.04, 1.00, 9.61, 6.45, 5.44,
min, 3.65,41.85,41.73,31.00, 3.66, 1.93, 2.64,

Jan 06 '21 15:01 mattip

Why isn't that reflected in the ASV results?

Jan 06 '21 15:01 mattip

I will check with Jack and review his benchmark, I did not work with him on his benchmark and I apologize for any confusion. The benchmarks are hard because we have not hooked the "initialization" functions yet (like ones, zeros, arange, etc). We also have not hooked the copy functions, copy with mask, etc. We also have not hooked the conversion functions. I spent the last 10 hours trying to figure out how to hook the conversion functions, calling PyArray_RegisterCastFunc.. but does not seem to work yet.

Your numbers above look good and expected. One dip is in division of integers because it converts from int to float64 and does so in the main thread, thus invalidating the other cores... which is why I am trying to hook more functions.

Ideally divide would "convert and divide" on the fly... but we also cannot hook that right now.

On a good note... there is pn.getitem() which acts like a[b] when a is an array, and b is a boolean or fancy index array. It runs in parallel. On another good note... I have reviewed so much numpy internal low level code, I understand it better and can at least suggest hooks.

Jan 06 '21 15:01 tdimitri

We're calling pn.initialize() within the ASV benchmarks: https://github.com/Quansight/numpy-threading-extensions/blob/97c60ed86fa105e18e1b5d2373576694863787be/benchmarks/bench_ufunc.py#L19

The current version of pn.initialize() just calls pn.init(): https://github.com/Quansight/numpy-threading-extensions/blob/97c60ed86fa105e18e1b5d2373576694863787be/src/pnumpy/init.py#L56

Jan 06 '21 16:01 jack-pappas

@mattip One thing that could be causing this -- I ran the latest benchmark code on Windows, and you're running it on Linux. asv supports running benchmarks in individual subprocesses, and (I'm speculating) it may be doing that by default on Linux but not on Windows, or asv is defaulting to a different approach for it on Windows vs. Linux. If that's the case, maybe we need to move the pn.initialize() call at the top of the bench_ufunc.py file, or e.g. have pnumpy auto-initialize when imported or detect when it's been forked (after pn.initialize() has been called) and re-initialize.

Jan 06 '21 20:01 jack-pappas