[HIP][bug] hipGraphAddKernelNode incompatible with hipFunction_t from hipModuleGetFunction
Hi everyone. Just tried to build a graph in combination with using the Module API. When trying to create a kernel node with a kernel from a module, hipGraphAddKernelNode returns hipErrorInvalidDeviceFunction. This issue can be reproduced by the following code:
kernels.cu:
#include <hip_runtime.h>
extern "C" __global__ void foo()
{
}
hipGraphTest.cpp:
#include <hip/hip_runtime.h>
#include <hip/hip_runtime_api.h>
#include <iostream>
#include <stdexcept>
namespace
{
void succeed(hipError_t e)
{
if(e != hipSuccess)
throw std::runtime_error(hipGetErrorName(e));
}
}
int main(int argc, char** argv)
{
try
{
hipModule_t hip_module;
succeed(hipModuleLoad(&hip_module, "kernels.hipfb"));
hipFunction_t foo;
succeed(hipModuleGetFunction(&foo, hip_module, "foo"));
hipGraph_t graph;
succeed(hipGraphCreate(&graph, 0));
hipKernelNodeParams node_params = { 0 };
node_params.func = (void*)foo;
node_params.gridDim = dim3(1, 1, 1);
node_params.blockDim = dim3(256, 1, 1);
node_params.sharedMemBytes = 0;
node_params.kernelParams = nullptr;
node_params.extra = nullptr;
hipGraphNode_t node;
try
{
succeed(hipGraphAddKernelNode(&node, graph, nullptr, 0, &node_params));
}
catch(std::runtime_error const& e)
{
std::cout << "hipGraphAddKernelNode failed with " << e.what() << std::endl;
}
succeed(hipGraphDestroy(graph));
succeed(hipModuleUnload(hip_module));
}
catch(...)
{
std::cout << "Something else broke" << std::endl;
}
return 0;
}
The code is built and run using:
hipcc kernels.cu --genco -o kernels.hipfb
hipcc hipGraphTest.cpp -o hipGraphTest
./hipGraphTest
Output:
hipGraphAddKernelNode failed with hipErrorInvalidDeviceFunction
hipconfig output:
HIP version : 4.4.21432-f9dccde4
== hipconfig
HIP_PATH : /opt/rocm-4.5.0/hip
ROCM_PATH : /opt/rocm-4.5.0
HIP_COMPILER : clang
HIP_PLATFORM : amd
HIP_RUNTIME : rocclr
CPP_CONFIG : -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-4.5.0/hip/include -I/opt/rocm-4.5.0/llvm/bin/../lib/clang/13.0.0 -I/opt/rocm-4.5.0/hsa/include
== hip-clang
HSA_PATH : /opt/rocm-4.5.0/hsa
HIP_CLANG_PATH : /opt/rocm-4.5.0/llvm/bin
AMD clang version 13.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-4.5.0 21422 e2489b0d7ede612d6586c61728db321047833ed8)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-4.5.0/llvm/bin
LLVM (http://llvm.org/):
LLVM version 13.0.0git
Optimized build with assertions.
Default target: x86_64-unknown-linux-gnu
Host CPU: znver1
Registered Targets:
amdgcn - AMD GCN GPUs
r600 - AMD GPUs HD2XXX-HD6XXX
x86 - 32-bit X86: Pentium-Pro and above
x86-64 - 64-bit X86: EM64T and AMD64
hip-clang-cxxflags : -std=c++11 -isystem "/opt/rocm-4.5.0/llvm/lib/clang/13.0.0/include/.." -isystem /opt/rocm-4.5.0/hsa/include -isystem "/opt/rocm-4.5.0/hip/include" -O3
hip-clang-ldflags : --driver-mode=g++ -L"/opt/rocm-4.5.0/hip/lib" -O3 -lgcc_s -lgcc -lpthread -lm -lrt
=== Environment Variables
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
== Linux Kernel
Hostname : G5-5505
Linux G5-5505 5.11.0-40-generic #44~20.04.2-Ubuntu SMP Tue Oct 26 18:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.3 LTS
Release: 20.04
Codename: focal
Does that mean that the module API is currently not compatible with the graph API? If so, (0) can someone please confirm that? (1) are there plans to support that?
PS.: Creating the graph via stream captures does not work in combination with the module API either, as it simply does not capture the kernel launches. But I'm pretty sure this is related, so I'm not gonna open another issue on that.
(0) can someone please confirm that?
We have observed similar behaviour when porting a CUDA code which uses graphs in combination with functions loaded from modules to HIP. In CUDA a function pointer obtained from cuModuleGetFunction can be added as a node in a graph whereas in HIP a function pointer obtained from hipModuleGetFunction can not be.
Just reconfirming this issue is still present in ROCm 5.
The issue is from our ISV partner. It has been already confirmed that it is a bug. The bug is critical for their software product GPU AUDIO which is based on ROCm software stack.
Colleagues, please start looking into it as it is of P1 priority now.
@mangupta, if your team has already been faced with the bug and the team knows the workaround, please give your advice.
The issue is from our ISV partner. It has been already confirmed that it is a bug. The bug is critical for their software product GPU AUDIO which is based on ROCm software stack.
Colleagues, please start looking into it as it is of P1 priority now.
@mangupta, if your team has already been faced with the bug and the team knows the workaround, please give your advice.
Just checking in to see if there has been any progress in this direction?
Appears to be fixed in 5.2. Other issues with the graph API remain, but this specific problem is now resolved.
For me hipGraphAddKernelNode still fails with 5.2. Although the error changed to:
hipGraphAddKernelNode failed with hipErrorInvalidValue
hipconfig output:
HIP version : 5.2.21151-afdc89f8
== hipconfig
HIP_PATH : /opt/rocm-5.2.0
ROCM_PATH : /opt/rocm-5.2.0
HIP_COMPILER : clang
HIP_PLATFORM : amd
HIP_RUNTIME : rocclr
CPP_CONFIG : -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-5.2.0/include -I/opt/rocm-5.2.0/llvm/bin/../lib/clang/14.0.0 -I/opt/rocm-5.2.0/hsa/include
== hip-clang
HSA_PATH : /opt/rocm-5.2.0/hsa
HIP_CLANG_PATH : /opt/rocm-5.2.0/llvm/bin
AMD clang version 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.2.0 22204 50d6d5d5b608d2abd6af44314abc6ad20036af3b)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.2.0/llvm/bin
AMD LLVM version 14.0.0git
Optimized build.
Default target: x86_64-unknown-linux-gnu
Host CPU: znver1
Registered Targets:
amdgcn - AMD GCN GPUs
r600 - AMD GPUs HD2XXX-HD6XXX
x86 - 32-bit X86: Pentium-Pro and above
x86-64 - 64-bit X86: EM64T and AMD64
hip-clang-cxxflags : -std=c++11 -isystem "/opt/rocm-5.2.0/llvm/lib/clang/14.0.0/include/.." -isystem /opt/rocm-5.2.0/hsa/include -isystem "/opt/rocm-5.2.0/include" -O3
hip-clang-ldflags : -L"/opt/rocm-5.2.0/lib" -O3 -lgcc_s -lgcc -lpthread -lm -lrt
=== Environment Variables
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
== Linux Kernel
Hostname : G5-5505
Linux G5-5505 5.11.0-40-generic #44~20.04.2-Ubuntu SMP Tue Oct 26 18:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.4 LTS
Release: 20.04
Codename: focal
I have started seeing this issue again with 5.4.0.
I can reproduce the hipErrorInvalidValue error with 5.6 as well. It looks like it comes from a check in hipGraphAddKernelNode for kernelParams and extra arguments not be set to nullptr at the same time:
https://github.com/ROCm-Developer-Tools/clr/blob/develop/hipamd/src/hip_graph.cpp#L78 i.e For this particular example, changing to a non void arg kernel and set the kernelParams should allow the test to go through.
However, I am not sure what is the reason for the check above and why HIP returns an error in this case, CUDA does not have this constraint. Will investigate more.
I have created an internal PR to fix this, kernelParams and extra arguments should be allowed to be both set to nullptr if the kernel does not expect any arguments like in this example. Once the change is reviewed the change should appear in the external clr repo https://github.com/ROCm-Developer-Tools/clr. Will keep this issue open until that is done.
@iassiour Has this issue been fixed? If so, can we close this ticket? Thanks!
The specific error appears to have been fixed although graph support is still buggy (see PyFR #312)