numexpr
numexpr copied to clipboard
NE3: Sum / max reduce operators?
Hey!
it looks like the reduce opcodes
('sum', 0, 'f') and ('max', 0, 'f')
aren't present in
ne3.ne3compiler.OPTABLE
So the expression max(x) where x is of shape (1024,) only is valid as max(x,y) - an element-wise, not reduce operation.
Would love your thoughts, might be missing something here, but had a lot of fun getting this far....
Regards! Ian
Yes they haven't been implemented yet. This does remind me that I wrote a bunch of issues in London but didn't post them.
How to handle reductions (e.g. mean, sum, prod, std) in NE3 remains an open question. In NE2 they could only be performed as the last operation, and then they were only single-threaded. Due to these limitations they did not contribute a lot of value to the library, in my opinion.
For reductions to be of significance to NE3 we should have the ability to perform them at any point (especially because NE3 supports multi-line operations) and thread them efficiently.
Currently the pre-allocation of the output array via broadcasting is fairly fragile. For reductions, a pre-requisite would be to verify the broadcasting through the entire program in-advance. This is likely necessary to perform in C inside NumExpr_run through the NumPy API because calls to d np.broadcast generate new Python objects and hence are quite expensive (~ 8 us per call). The actual C-code that np.broadcast uses is relatively simple and could be copied directly.
Yes they haven't been implemented yet.
makes sense, thanks!
For reductions to be of significance to NE3 we should have the ability to perform them at any point (especially because NE3 supports multi-line operations) and thread them efficiently.
Agreed. I think threading is one of the performance wins for which I'm searching. Multiline exprs will probably come into play when I'm not switching back to native Python for my reduce operations. It's easy to conceptualise how having N cores each calculating the max of 1/N of an array, followed by a max of a len-N array would offer a speedup, but based on the below experiments, this requires finess to avoid Linux and Python implementation details.
Here're some synthetic benchmarks looking at different ways of taking the max of a 2**30 array of float64s - format is (output, duration of one run) as per the script below. I disabled dynamic cpu control and got fairly stable measurements.
('single threaded map-reduce', 0.99999999779209781, 0.39780282974243164)
('numexpr', array(0.9999999977920978), 2.0326719284057617)
('joblib insanity', 0.99999999779209781, 0.3058278560638428)
('numpy', 0.99999999779209781, 0.37674617767333984)
Joblib is pretty kludgy here, I was just validating that multicore map-reduce offers some measureable savings. Pretty sure it copies my array per child thread.
import numpy as np
import joblib
import time
import numexpr
xs = np.random.random(2**30).astype('float64')
spans = lambda n: np.linspace(0, xs.shape[0], num=n+1, dtype=int)
map_reduce = lambda n: np.max([np.max(xs[x:y]) for (x,y) in zip(spans(n),spans(n)[1:])])
divide_n_conquor = lambda n: np.max(joblib.Parallel(n_jobs=n, backend='threading') \
(joblib.delayed(np.max)(xs[x:y]) for (x,y) in zip(spans(n),spans(n)[1:])) )
#warm up numexpr cache
numexpr.evaluate('max(xs)')
numexpr.evaluate('max(xs)')
print("starting benchmarks")
start = time.time()
print('single threaded map-reduce', map_reduce(4), time.time()-start)
start = time.time()
print('numexpr', numexpr.evaluate('max(xs)'), time.time()-start)
start = time.time()
print('joblib insanity', divide_n_conquor(4), time.time()-start)
start = time.time()
print('numpy', np.max(xs), time.time()-start)
It's been suggested in a number of places that we offer the ability to control the accumulator dtype (#236, #271), so this could be an additional keyword option in-addition to axis.
Message to comment on stale issues. If none provided, will not mark issues stale