numexpr NE3: Sum / max reduce operators?

Hey!

it looks like the reduce opcodes

('sum', 0, 'f') and ('max', 0, 'f')

aren't present in

ne3.ne3compiler.OPTABLE

So the expression max(x) where x is of shape (1024,) only is valid as max(x,y) - an element-wise, not reduce operation.

Would love your thoughts, might be missing something here, but had a lot of fun getting this far....

Regards! Ian

Mar 14 '17 02:03 itdaniher

Yes they haven't been implemented yet. This does remind me that I wrote a bunch of issues in London but didn't post them.

How to handle reductions (e.g. mean, sum, prod, std) in NE3 remains an open question. In NE2 they could only be performed as the last operation, and then they were only single-threaded. Due to these limitations they did not contribute a lot of value to the library, in my opinion. For reductions to be of significance to NE3 we should have the ability to perform them at any point (especially because NE3 supports multi-line operations) and thread them efficiently.
Currently the pre-allocation of the output array via broadcasting is fairly fragile. For reductions, a pre-requisite would be to verify the broadcasting through the entire program in-advance. This is likely necessary to perform in C inside NumExpr_run through the NumPy API because calls to d np.broadcast generate new Python objects and hence are quite expensive (~ 8 us per call). The actual C-code that np.broadcast uses is relatively simple and could be copied directly.

Mar 14 '17 03:03 robbmcleod

Yes they haven't been implemented yet.

makes sense, thanks!

For reductions to be of significance to NE3 we should have the ability to perform them at any point (especially because NE3 supports multi-line operations) and thread them efficiently.

Agreed. I think threading is one of the performance wins for which I'm searching. Multiline exprs will probably come into play when I'm not switching back to native Python for my reduce operations. It's easy to conceptualise how having N cores each calculating the max of 1/N of an array, followed by a max of a len-N array would offer a speedup, but based on the below experiments, this requires finess to avoid Linux and Python implementation details.

Here're some synthetic benchmarks looking at different ways of taking the max of a 2**30 array of float64s - format is (output, duration of one run) as per the script below. I disabled dynamic cpu control and got fairly stable measurements.

('single threaded map-reduce', 0.99999999779209781, 0.39780282974243164)
('numexpr', array(0.9999999977920978), 2.0326719284057617)
('joblib insanity', 0.99999999779209781, 0.3058278560638428)
('numpy', 0.99999999779209781, 0.37674617767333984)

Joblib is pretty kludgy here, I was just validating that multicore map-reduce offers some measureable savings. Pretty sure it copies my array per child thread.

import numpy as np                                                                                                                                                                                                 
import joblib                                                                                                                                                                                                      
import time                                                                                                                                                                                                        
import numexpr                                                                                                                                                                                                     
                                                                                                                                                                                                                   
xs = np.random.random(2**30).astype('float64')                                                                                                                                                                     
spans = lambda n: np.linspace(0, xs.shape[0], num=n+1, dtype=int)                                                                                                                                                  
map_reduce = lambda n: np.max([np.max(xs[x:y]) for (x,y) in zip(spans(n),spans(n)[1:])])                                                                                                                           
divide_n_conquor = lambda n: np.max(joblib.Parallel(n_jobs=n, backend='threading') \                                                                                                                               
    (joblib.delayed(np.max)(xs[x:y]) for (x,y) in zip(spans(n),spans(n)[1:])) )                                                                                                                                    
                                                                                       
                                                                                                                                                                                                                   
#warm up numexpr cache                                                                                                                                                                                             
numexpr.evaluate('max(xs)')                                                                                                                                                                                        
numexpr.evaluate('max(xs)')                                                                                                                                                                                        
print("starting benchmarks")                                                                                                                                                                                       
start = time.time()                                                                                                                                                                                                
print('single threaded map-reduce', map_reduce(4), time.time()-start)                                                                                                                                              
start = time.time()                                                                                                                                                                                                
print('numexpr', numexpr.evaluate('max(xs)'), time.time()-start)                                                                                                                                                   
start = time.time()                                                                                                                                                                                                
print('joblib insanity', divide_n_conquor(4), time.time()-start)                                                                                                                                                   
start = time.time()                                                                                                                                                                                                
print('numpy', np.max(xs), time.time()-start)

Mar 15 '17 01:03 itdaniher

It's been suggested in a number of places that we offer the ability to control the accumulator dtype (#236, #271), so this could be an additional keyword option in-addition to axis.

Sep 09 '17 00:09 robbmcleod

Message to comment on stale issues. If none provided, will not mark issues stale

Feb 21 '24 01:02 github-actions[bot]

numexpr numexpr copied to clipboard

NE3: Sum / max reduce operators?

numexpr
numexpr copied to clipboard