numba-dpex
numba-dpex copied to clipboard
Allocating memory inside prange fails
Knn workload in dpbench fails since it allocates memory inside prange as shown below. To successfully executee knn, numba-dpex needs to add support for allocating memory inside prange loops.
for i in nb.prange(test_size):
queue_neighbors = np.empty(shape=(k, 2))
...
How to reproduce:
Follow instructions to setup dpbench Run knn - python -c "import dpbench; dpbench.run_benchmark("knn")
@adarshyoga Can the allocation (https://github.com/IntelPython/dpbench/blob/64651cc2c364a3a004e5b38caa447a87938ed3bb/dpbench/benchmarks/knn/knn_numba_dpex_p.py#L25) be moved out of the prange?
I am in the process of designing the interface for our prange feature and starting a discussion on what a prange means for numb-dpex. In general, we have to clearly define the scope of what is supported and what is not supported. The goal also is to make it possible to use loop analysis and optimization by limiting the expressiveness of a prange.
Right now a prange is completely lowered as a kernel and there is no easy way to allow user functions (such as numpy.empty or dpnp.epmty) in a kernel and be able to deterministically optimize it way.
It is possible to move the allocation out for knn. It will require some changes to the prange loop's body to adjust indexing, which should not be difficult.
But the main implication could be the size of the allocation that can be made outside the prange. We are allocating queue_neighbors
which is of size k*2
for each work item. If it is moved out, we will be making a total allocation of size test_size*k*2
. If test_size
is large this allocation might fail. We will need to test it out.
We are allocating
queue_neighbors
which is of sizek*2
for each work item.
Is not the queue_neighbors
overwritten inside the prange
for every loop iteration? Would be able to allocate it once outside and reuse every iteration?
queue_neighbors = np.empty(shape=(k, 2))
for j in range(k):
x1 = x_train[j]
x2 = x_test[i]
distance = 0.0
for jj in range(data_dim):
diff = x1[jj] - x2[jj]
distance += diff * diff
dist = math.sqrt(distance)
queue_neighbors[j, 0] = dist
queue_neighbors[j, 1] = y_train[j]
Every iteration uses a private local copy of queue_neighbors
. Just moving the allocation outside the prange will make it shared across all work-items and result in a race condition.