dpnp
dpnp copied to clipboard
Implement asynchronous `fill` method using `dpctl` kernels
This PR proposes a change to dpnp_array.fill
method which leverages dpctl kernels to make fill
asynchronous and more efficient, avoiding repeated calls to index the array and copying scalars to the device for each element.
Shows significant performance gains on Iris Xe in WSL
Before
In [1]: import dpnp as dnp
In [2]: x_dnp = dnp.empty(10000, dtype="c8")
In [3]: %timeit x_dnp.fill(10)
1.25 s ± 47.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %timeit x_dnp.fill(10)
1.26 s ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
After
In [8]: %timeit x_dnp.fill(10); q.wait()
229 μs ± 37.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
- [X] Have you provided a meaningful PR description?
- [ ] Have you added a test, reproducer or referred to issue with a reproducer?
- [x] Have you tested your changes locally for CPU and GPU devices?
- [x] Have you made sure that new changes do not introduce compiler warnings?
- [X] Have you checked performance impact of proposed changes?
- [X] If this PR is a work in progress, are you filing the PR as a draft?