MIOpen [CONV] fix naive conv kernel for large tensors

for larger tensor I was seeing gdims[0] : 38,654,705,664 globalWorkSizeX : 4,294,967,295 (max allowed by uint32_t) ~~MaxGridDimX : 2,147,483,647~~

gdims[0] was exceeding ~~MaxGridDimX~~ globalWorkSizeX for below driver command.

./bin/MIOpenDriver convbfp16 -n 589824 -c 256 -H 4 -w 34 -k 256 -y 1 -x 3 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m c onv -g 1 -F 1 -t 1

In this PR I reduce the gdims[0] to be set within MaxGridDimX.

Dec 12 '24 12:12 bghimireamd

can we add a unit test for this?

It's barely possible, at least without affecting the current development and CI, and at least the test should be added as a special "huge tensor" tests - specifically tailored case for specific machines.

we may have memory problems on CI machines - not every GPU can support so huge tensors (probably it's not relevant for MI200+ cards)
we will have a problem with tensor initialization on the host - probably the problem does not exist for convolutions, since Artem replaced CPU based tensor init with GPU based (and probably for some particular datatypes), but all the other algorithms do not use GPU initialization
we will have problems with verification - even if we initialize everything on gpu and compute everything on gpu, we still have to copy the results and the reference data back to CPU and run slow CPU based verification
it's almost like 3 - since it's a naive algorithm, we are using CPU for it's verification. CPU convolution is kind of slow for huge tensors

Dec 13 '24 04:12 CAHEK7

can we add a unit test for this?

It's barely possible, at least without affecting the current development and CI, and at least the test should be added as a special "huge tensor" tests - specifically tailored case for specific machines.

we may have memory problems on CI machines - not every GPU can support so huge tensors (probably it's not relevant for MI200+ cards)

we will have a problem with tensor initialization on the host - probably the problem does not exist for convolutions, since Artem replaced CPU based tensor init with GPU based (and probably for some particular datatypes), but all the other algorithms do not use GPU initialization

we will have problems with verification - even if we initialize everything on gpu and compute everything on gpu, we still have to copy the results and the reference data back to CPU and run slow CPU based verification

it's almost like 3 - since it's a naive algorithm, we are using CPU for it's verification. CPU convolution is kind of slow for huge tensors

We do have https://github.com/ROCm/MIOpen/blob/develop/test/gpu_reference_kernel.cpp

Dec 13 '24 15:12 bghimireamd

We do have https://github.com/ROCm/MIOpen/blob/develop/test/gpu_reference_kernel.cpp

Yes, that's a naive CPU single threaded ultra slow verification for naive GPU algorithm. That test is not about "huge" tensors, it has exactly those problems which I described.

Dec 13 '24 21:12 CAHEK7

can we add a unit test for this?

It's barely possible, at least without affecting the current development and CI, and at least the test should be added as a special "huge tensor" tests - specifically tailored case for specific machines.

we may have memory problems on CI machines - not every GPU can support so huge tensors (probably it's not relevant for MI200+ cards)

we will have a problem with tensor initialization on the host - probably the problem does not exist for convolutions, since Artem replaced CPU based tensor init with GPU based (and probably for some particular datatypes), but all the other algorithms do not use GPU initialization

we will have problems with verification - even if we initialize everything on gpu and compute everything on gpu, we still have to copy the results and the reference data back to CPU and run slow CPU based verification

it's almost like 3 - since it's a naive algorithm, we are using CPU for it's verification. CPU convolution is kind of slow for huge tensors

Yes, we do need to do the slow cpu run. I can the test a nightly run.

Dec 16 '24 10:12 bghimireamd

Yes, we do need to do the slow cpu run. I can the test a nightly run.

I'm not sure that we do need. It depends on the way how we treat the reference data.

For example, when two algorithms have a consensus during a manual run, we can assume that they produce the same data. Following that logic, we can do the following steps:

manually run naive GPU vs naive CPU and be sure that both are producing the same output
add the same case, but using naive GPU and non-naive GPU algorithms - assuming that if naive GPU produced correct result on step 1 and naive and non-naive GPUs are doing the same, it means that everything is fine.

The only case when it can get broken is when we simultaneously and exactly in the same way break naive and non-naive implementations - in that case both algorithms produce the same wrong result, having the test passed. In the other cases, the test will indicate that either naive or non-naive version is broken, and that's enough to start manually checking everything.

Dec 17 '24 00:12 CAHEK7

This still needs to be fixed but is now stale so closing.

Jul 24 '25 15:07 BradPepersAMD