ginkgo
ginkgo copied to clipboard
the kernel limitation of gridSize and blockSize
Due to the limitation of block/grid size, the max of gridDim.x is 2^31 - 1
and max of blockDim.x is 1024
. (in cuda) We usually use this two parameter to get the row index of matrix.
If the kernel assigns one or multi- threads per row (such as ell) and the matrix has a large number of row, the gridDim will exceeds the limitation.
The possible solution:
- set config:max_thread or max_grid. every kernel change the grid size according to config and add a for loop for this consideration.
- use the nwarp like load balance coo and load balance csr
I prefer 1 for the default behavior because it need less effort and does not need to tuning the parameters.
Maybe set max_thread as 2^31 - 1 (int32) such that we compute the tid without static_cast
and the blocks should be enough to cover latency.
An alternative would also be to not just use 1D Block index, but 2D (with the 2nd dimension only used when we exceed the limit for blockDim.x
). However, that would mean we have to rewrite basically all calculations of the dim3
s and in all kernels the calculation for the grid index.
The first solution sounds also good to me, but I am not quite sure what you mean by the second solution.
@yhmtsai is this still relevant today, or has it been addressed by the ROCm runtime?