physt Use cupy / numba to speed up computation?

Use https://github.com/cupy/cupy as computing backend? Given the API similarity to numpy, it should be trivial.

Investigate if numba can lead to any speed up.

Oct 03 '18 20:10 janpipek

https://github.com/astrofrog/fast-histogram also looks promising

Oct 03 '18 20:10 janpipek

By the way, I looked at speed a bit, and I think there's a bit to be gained! I had 128 1M element arrays I needed 2D histograms from, and I had switched to Physt to make it simple to rebin and sum them (which it did!). But I found using numpy then giving it to Physt gave 3x better performance. I then redid it in numba and got 8-9x more performance. I hadn't tried fast-histogram, but it looks pretty similar to numba. Here are the timings. I originally did this on a Xeon Phi, which has 64 very slow cores and 4x hyperthreading. Obviously, none of these solutions is using it very well (to be honest, not much does). I reran it on my laptop too.

import numpy as np
import numba
import math
from physt import h2
import fast_histogram 
from physt.histogram_nd import Histogram2D

bins=(100, 100)
ranges=((-1,1),(-1,1))
bins = np.asarray(bins).astype(np.int64)
ranges = np.asarray(ranges).astype(np.float64)
    
edges = (np.linspace(*ranges[0,:], bins[0]+1),
         np.linspace(*ranges[1,:], bins[1]+1))

# This is 8-9 times faster than np.histogram
# We give the signature here so it gets precompmiled
# In theory, this could even be threaded (nogil!)
@numba.njit(nogil=True)
def fast_histogram2d(tracks, bins, ranges):
    H = np.zeros((bins[0], bins[1]), dtype=np.uint64)
    delta = 1/((ranges[:,1] - ranges[:,0]) / bins)
    for t in range(tracks.shape[1]):
        i = math.floor((tracks[0,t] - ranges[0,0]) * delta[0])
        j = math.floor((tracks[1,t] - ranges[1,0]) * delta[1])
        if 0 <= i < bins[0] and 0 <= j < bins[1]:
            H[i,j] += 1
    return H

vals = np.random.normal(size=[2, 1000000]).astype(np.float32)

%%timeit
h2(*vals, bins=edges)

Xeon Phi: 1.14 s ± 556 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
MacBook Pro: 293 ms ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
H, *ledges = np.histogram2d(*vals, bins=bins, range=ranges)
Histogram2D(ledges, H)

Xeon Phi: 437 ms ± 2.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
MacBook Pro: 114 ms ± 7.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit

H = fast_histogram.histogram2d(*vals, bins=100, range=((-1,1),(-1,1)))
Histogram2D(edges, H)

Xeon Phi: 53.7 ms ± 88.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
MacBook Pro: 10.4 ms ± 1.8 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit

H = fast_histogram2d(vals, bins=bins, ranges=ranges)
Histogram2D(edges, H)

Xeon Phi: 41.6 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
MacBook Pro: 10.2 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Oct 31 '18 20:10 henryiii

I also think it would be helpful to add the numpy to physt example to the docs, since there's a speed benefit.

Oct 31 '18 20:10 henryiii

Extra example: With numba, you can also release the GIL, though for some reason it only works on my macbook or a normal Xeon:

from concurrent.futures import ThreadPoolExecutor
from functools import partial

%%timeit
splits = 4
with ThreadPoolExecutor(max_workers=splits) as pool:
    chunk = vals.shape[1] // splits
    chunks = (vals[:,i*chunk:(i+1)*chunk] for i in range(splits))
    f = partial(fast_histogram2d, bins=bins, ranges=ranges)
    results = pool.map(f, chunks)
    results = sum(results)

MacBook Pro: 4.23 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Oct 31 '18 21:10 henryiii

Hi Henry,

thanks a lot for the ideas. I know that physt is a bit slow (especially for cases that are trivial with numpy). In the new design (if it ever comes), I aim to distinguish cases that require special approach (various clever binnings) and those that don't (and computation can be passed to numpy directly). The very basic kick out for supporting various backends (and their prioritisation) started in the branch https://github.com/janpipek/physt/tree/v04-engines/ btw.

I'd like to incorporate numba as well (as optional - I don't want this as hard dependency).

Unfortunately, I won't be able to touch the code until the end of November (or even attend a discussion about it). I am putting this issue in my internal queue and will get to it (and you) immediately afterwards. Sorry for the necessary delay :-(

Cheers, Jan

Nov 01 '18 13:11 janpipek

Here's my post on histogram performance: https://iscinumpy.gitlab.io/post/histogram-speeds-in-python/ (all examples eventually produce a Physt histogram)

Nov 01 '18 14:11 henryiii