[BUG] Launch conv kernel with offline ptx failed as invalid argument
I am trying to runtime load offline compiled ptx using the same CUDA source file and launch kernel using cuLaunchKernel, but examples/16_ampere_tensorop_conv2dfprop failed with driver error code 1.
> CUDA Driver error at .../cutlass/examples/16_ampere_tensorop_conv2dfprop/../../include/cutlass/conv/device/implicit_gemm_convolution.h:285 code=1(invalid argument) "cuLaunchKernel(kernel, grid.x, grid.y, grid.z, block.x, block.y, block.z, smem_size, stream, args, 0)"
> terminate called after throwing an instance of 'std::runtime_error'
> what(): CUDA driver failed.
It's OK to use the same launch call if getting kernel with cudaGetFuncBySymbol. I also tried examples/00_basic_gemm, both methods run smoothly.
Environment: CUDA 11.6 on Ubuntu 20, RTX 3090 steps to reproduce:
- Apply the attached patch to latest code
- change LAUNCH_MODE to 0
- cd examples/16_ampere_tensorop_conv2dfprop/
- make
- cp tmp/*.ptx kernel.ptx
- change LAUNCH_MODE back to 2
- make clean && make
- ./case It will show the error message as above.
Sorry, I don't know. I checked your code and I cannot find anything obvious. Maybe you can change this https://github.com/NVIDIA/cutlass/blob/master/examples/16_ampere_tensorop_conv2dfprop/ampere_tensorop_conv2dfprop.cu#L167 to kAnalytic first. kOptimized uses complex params to compute things in the host before sending to the device. Maybe that part is wrong.
I know JIT cutlass conv is doable because many users user cutlass in this way.
Thanks for the quick reply.
I tried kAnalytic, got same invalid argument error.
In fact, it can work using the same cuda driver API with the same parameters, just using alternative way (cudaGetFuncBySymbol) to retrieve kernel handle (LAUNCH_MODE=1 in the patch)
It seems executable builtin fatbin might be different than offline compiled one, I also tried following method to get offline ptx/cubin/fatbin, none can work.
- nvcc compile time saved cubin
- nvcc compile time saved fatbin
- cuobjdump ptx from the executable
- cuobjdump sass from the executable
Probably it need NVIDIA driver or compiler team's help to root cause.
@mnicely May I know any updates for the bug? Thanks.
@mnicely @hwu36 any updates after 2 weeks? Thanks.
Sorry, I haven't looked into your case again. I know nvrtc + conv can work since many users make it work.
We are working on enable cutlass conv with python. You may take a look when we enable it. See https://github.com/NVIDIA/cutlass/issues/474
@hwu36 I am enabling nvrtc using cutlass conv, encountered the issue. I suspected runtime compiled PTX might not be as expected , so I compared it with nvcc kept PTX, they were almost same. And both PTX got same launching error described in the beginning. Gemm case were OK using either of the PTX generated by the same flow.
Any example for nvrtc + cutlass conv? waiting for #474?
waiting for https://github.com/NVIDIA/cutlass/issues/474?
yes if you still cannot figure out.
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
pycutlass implemented in 2.10 supports to use nvrtc to build conv.