ginkgo the kernel limitation of gridSize and blockSize

the kernel limitation of gridSize and blockSize

Open yhmtsai opened this issue 5 years ago • 2 comments

Due to the limitation of block/grid size, the max of gridDim.x is 2^31 - 1 and max of blockDim.x is 1024. (in cuda) We usually use this two parameter to get the row index of matrix. If the kernel assigns one or multi- threads per row (such as ell) and the matrix has a large number of row, the gridDim will exceeds the limitation. The possible solution:

set config:max_thread or max_grid. every kernel change the grid size according to config and add a for loop for this consideration.
use the nwarp like load balance coo and load balance csr

I prefer 1 for the default behavior because it need less effort and does not need to tuning the parameters. Maybe set max_thread as 2^31 - 1 (int32) such that we compute the tid without static_cast and the blocks should be enough to cover latency.

Dec 20 '19 10:12 yhmtsai

An alternative would also be to not just use 1D Block index, but 2D (with the 2nd dimension only used when we exceed the limit for blockDim.x). However, that would mean we have to rewrite basically all calculations of the dim3s and in all kernels the calculation for the grid index.

The first solution sounds also good to me, but I am not quite sure what you mean by the second solution.

Dec 20 '19 14:12 thoasm

@yhmtsai is this still relevant today, or has it been addressed by the ROCm runtime?

Mar 04 '22 13:03 upsj

ginkgo ginkgo copied to clipboard

the kernel limitation of gridSize and blockSize

ginkgo
ginkgo copied to clipboard