Use cupy / numba to speed up computation?
Use https://github.com/cupy/cupy as computing backend? Given the API similarity to numpy, it should be trivial.
Investigate if numba can lead to any speed up.
https://github.com/astrofrog/fast-histogram also looks promising
By the way, I looked at speed a bit, and I think there's a bit to be gained! I had 128 1M element arrays I needed 2D histograms from, and I had switched to Physt to make it simple to rebin and sum them (which it did!). But I found using numpy then giving it to Physt gave 3x better performance. I then redid it in numba and got 8-9x more performance. I hadn't tried fast-histogram, but it looks pretty similar to numba. Here are the timings. I originally did this on a Xeon Phi, which has 64 very slow cores and 4x hyperthreading. Obviously, none of these solutions is using it very well (to be honest, not much does). I reran it on my laptop too.
import numpy as np
import numba
import math
from physt import h2
import fast_histogram
from physt.histogram_nd import Histogram2D
bins=(100, 100)
ranges=((-1,1),(-1,1))
bins = np.asarray(bins).astype(np.int64)
ranges = np.asarray(ranges).astype(np.float64)
edges = (np.linspace(*ranges[0,:], bins[0]+1),
np.linspace(*ranges[1,:], bins[1]+1))
# This is 8-9 times faster than np.histogram
# We give the signature here so it gets precompmiled
# In theory, this could even be threaded (nogil!)
@numba.njit(nogil=True)
def fast_histogram2d(tracks, bins, ranges):
H = np.zeros((bins[0], bins[1]), dtype=np.uint64)
delta = 1/((ranges[:,1] - ranges[:,0]) / bins)
for t in range(tracks.shape[1]):
i = math.floor((tracks[0,t] - ranges[0,0]) * delta[0])
j = math.floor((tracks[1,t] - ranges[1,0]) * delta[1])
if 0 <= i < bins[0] and 0 <= j < bins[1]:
H[i,j] += 1
return H
vals = np.random.normal(size=[2, 1000000]).astype(np.float32)
%%timeit
h2(*vals, bins=edges)
Xeon Phi: 1.14 s ± 556 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
MacBook Pro: 293 ms ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
H, *ledges = np.histogram2d(*vals, bins=bins, range=ranges)
Histogram2D(ledges, H)
Xeon Phi: 437 ms ± 2.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
MacBook Pro: 114 ms ± 7.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
H = fast_histogram.histogram2d(*vals, bins=100, range=((-1,1),(-1,1)))
Histogram2D(edges, H)
Xeon Phi: 53.7 ms ± 88.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
MacBook Pro: 10.4 ms ± 1.8 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
H = fast_histogram2d(vals, bins=bins, ranges=ranges)
Histogram2D(edges, H)
Xeon Phi: 41.6 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
MacBook Pro: 10.2 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
I also think it would be helpful to add the numpy to physt example to the docs, since there's a speed benefit.
Extra example: With numba, you can also release the GIL, though for some reason it only works on my macbook or a normal Xeon:
from concurrent.futures import ThreadPoolExecutor
from functools import partial
%%timeit
splits = 4
with ThreadPoolExecutor(max_workers=splits) as pool:
chunk = vals.shape[1] // splits
chunks = (vals[:,i*chunk:(i+1)*chunk] for i in range(splits))
f = partial(fast_histogram2d, bins=bins, ranges=ranges)
results = pool.map(f, chunks)
results = sum(results)
MacBook Pro: 4.23 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Hi Henry,
thanks a lot for the ideas. I know that physt is a bit slow (especially for cases that are trivial with numpy). In the new design (if it ever comes), I aim to distinguish cases that require special approach (various clever binnings) and those that don't (and computation can be passed to numpy directly). The very basic kick out for supporting various backends (and their prioritisation) started in the branch https://github.com/janpipek/physt/tree/v04-engines/ btw.
I'd like to incorporate numba as well (as optional - I don't want this as hard dependency).
Unfortunately, I won't be able to touch the code until the end of November (or even attend a discussion about it). I am putting this issue in my internal queue and will get to it (and you) immediately afterwards. Sorry for the necessary delay :-(
Cheers, Jan
Here's my post on histogram performance: https://iscinumpy.gitlab.io/post/histogram-speeds-in-python/ (all examples eventually produce a Physt histogram)