dpnp Implement asynchronous `fill` method using `dpctl` kernels

Implement asynchronous `fill` method using `dpctl` kernels

Open ndgrigorian opened this issue 5 months ago • 1 comments

This PR proposes a change to dpnp_array.fill method which leverages dpctl kernels to make fill asynchronous and more efficient, avoiding repeated calls to index the array and copying scalars to the device for each element.

Shows significant performance gains on Iris Xe in WSL

Before

In [1]: import dpnp as dnp

In [2]: x_dnp = dnp.empty(10000, dtype="c8")

In [3]: %timeit x_dnp.fill(10)
1.25 s ± 47.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit x_dnp.fill(10)
1.26 s ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

After

In [8]: %timeit x_dnp.fill(10); q.wait()
229 μs ± 37.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

[X] Have you provided a meaningful PR description?
[ ] Have you added a test, reproducer or referred to issue with a reproducer?
[x] Have you tested your changes locally for CPU and GPU devices?
[x] Have you made sure that new changes do not introduce compiler warnings?
[X] Have you checked performance impact of proposed changes?
[X] If this PR is a work in progress, are you filing the PR as a draft?

Sep 17 '24 01:09 ndgrigorian

dpnp dpnp copied to clipboard

Implement asynchronous `fill` method using `dpctl` kernels

dpnp
dpnp copied to clipboard