dpnp icon indicating copy to clipboard operation
dpnp copied to clipboard

Performance issue with NaN functions

Open jeshi96 opened this issue 1 year ago • 7 comments

Hi,

I've been testing dpnp on CPU with some standard NaN functions (nan_to_num, nansum), and my performance results seem to show that dpnp is quite slow single-threaded compared to NumPy. I was wondering if you had any insight to why this might be the case (e.g., if there is something specific dpnp does in handling NaNs), and/or if there was a fix for this.

Here are some of the scalability plots (number of threads vs. running time) comparing dpnp and NumPy (and Numba), for nan_to_num and nansum respectively:

image

image

While the dpnp scaling looks good, the single-threaded performance in particular is almost an order of magnitude worse. The test environment is an Intel Xeon Platinum 8380 processor, with 80 threads. Both tests were run on arrays with 6.4e9 float32s, taking the average (median) over 10 runs and discarding the first run (so the cache is warm).

Here was the code to generate the input array for all of these tests:

import numpy as np
from numpy.random import default_rng

N = 80000
rng = default_rng(42)
array_1 = rng.uniform(0, 1000, size=(N * N,)).astype(np.float32)

N_nan = N // 10
nan_choice = np.random.choice(N * N, N_nan, replace=False)
array_1[nan_choice] = np.nan
array_1 = array_1.reshape((N, N))

For dpnp, I ran array_1 = dpnp.asarray(array_1, device="cpu") before starting the tests (not included in the timing results). The times were measuring only array_out = np.nan_to_num(array_1) or array_out = dpnp.nan_to_num(array_1) (similarly for nansum).

Any help is much appreciated -- thanks!

Best, Jessica

jeshi96 avatar Oct 04 '24 15:10 jeshi96

@jeshi96 , could you please specify what dpnp version you used? The output of python -c "import dpnp; print(dpnp.__version__)" should be enough.

antonwolfy avatar Oct 08 '24 09:10 antonwolfy

I'm using 0.17.0dev0+14.g8cab1af7f1a, from commit 8cab1af7f1a (built from source). I'm also using dpctl built from source (0.19.0dev0+44.g8c47c65635, from commit 8c47c65635).

jeshi96 avatar Oct 08 '24 11:10 jeshi96

Hi, I just wanted to follow up on this and see if you had any idea of the possible cause -- I saw that there was a new release, and reran the tests on the latest commits (of both dpnp and dpctl), but with the same effect. Thanks!

jeshi96 avatar Nov 04 '24 19:11 jeshi96

Hi @jeshi96,

A PR is now open which should improve the performance of nan_to_num on all devices, including CPU, by implementing a dedicated kernel.

Previously it was a fully top-level function. This brings a significant performance gain in at least this function.

ndgrigorian avatar Dec 10 '24 19:12 ndgrigorian

Thanks so much for the update! I will take a look!

jeshi96 avatar Dec 20 '24 22:12 jeshi96

https://github.com/IntelPython/dpnp/pull/2228 is now merged into master, bringing some improvement (at least to nan_to_num).

The next step will be to re-use the new nan_to_num in nansum, nanprod, et al.

ndgrigorian avatar Feb 05 '25 23:02 ndgrigorian

With https://github.com/IntelPython/dpnp/pull/2339 merged, all nan functions besides nanargmin and nanargmax have seen significant performance gains.

ndgrigorian avatar Mar 07 '25 07:03 ndgrigorian

Closing as it has been implemented in scope of 0.18.0 release

antonwolfy avatar Jun 20 '25 10:06 antonwolfy