numba-dpex Allocating memory inside prange fails

Knn workload in dpbench fails since it allocates memory inside prange as shown below. To successfully executee knn, numba-dpex needs to add support for allocating memory inside prange loops.

    for i in nb.prange(test_size):
        queue_neighbors = np.empty(shape=(k, 2))
         ...

How to reproduce:

Follow instructions to setup dpbench Run knn - python -c "import dpbench; dpbench.run_benchmark("knn")

Mar 28 '23 04:03 adarshyoga

@adarshyoga Can the allocation (https://github.com/IntelPython/dpbench/blob/64651cc2c364a3a004e5b38caa447a87938ed3bb/dpbench/benchmarks/knn/knn_numba_dpex_p.py#L25) be moved out of the prange?

I am in the process of designing the interface for our prange feature and starting a discussion on what a prange means for numb-dpex. In general, we have to clearly define the scope of what is supported and what is not supported. The goal also is to make it possible to use loop analysis and optimization by limiting the expressiveness of a prange.

Right now a prange is completely lowered as a kernel and there is no easy way to allow user functions (such as numpy.empty or dpnp.epmty) in a kernel and be able to deterministically optimize it way.

Mar 29 '23 00:03 diptorupd

It is possible to move the allocation out for knn. It will require some changes to the prange loop's body to adjust indexing, which should not be difficult. But the main implication could be the size of the allocation that can be made outside the prange. We are allocating queue_neighbors which is of size k*2 for each work item. If it is moved out, we will be making a total allocation of size test_size*k*2. If test_size is large this allocation might fail. We will need to test it out.

Mar 29 '23 04:03 adarshyoga

We are allocating queue_neighbors which is of size k*2 for each work item.

Is not the queue_neighbors overwritten inside the prange for every loop iteration? Would be able to allocate it once outside and reuse every iteration?

queue_neighbors = np.empty(shape=(k, 2))

for j in range(k):
    x1 = x_train[j]
    x2 = x_test[i]

distance = 0.0
for jj in range(data_dim):
    diff = x1[jj] - x2[jj]
    distance += diff * diff
dist = math.sqrt(distance)

queue_neighbors[j, 0] = dist
queue_neighbors[j, 1] = y_train[j]

Mar 29 '23 15:03 diptorupd

Every iteration uses a private local copy of queue_neighbors. Just moving the allocation outside the prange will make it shared across all work-items and result in a race condition.

Mar 29 '23 15:03 adarshyoga

numba-dpex numba-dpex copied to clipboard

Allocating memory inside prange fails

numba-dpex
numba-dpex copied to clipboard