HIP icon indicating copy to clipboard operation
HIP copied to clipboard

[HIP][bug] hipGraphAddKernelNode incompatible with hipFunction_t from hipModuleGetFunction

Open mlakardaniel opened this issue 4 years ago • 9 comments

Hi everyone. Just tried to build a graph in combination with using the Module API. When trying to create a kernel node with a kernel from a module, hipGraphAddKernelNode returns hipErrorInvalidDeviceFunction. This issue can be reproduced by the following code:

kernels.cu:

#include <hip_runtime.h>
extern "C" __global__ void foo()
{
}

hipGraphTest.cpp:

#include <hip/hip_runtime.h>
#include <hip/hip_runtime_api.h>

#include <iostream>
#include <stdexcept>

namespace
{
	void succeed(hipError_t e)
	{
		if(e != hipSuccess)
			throw std::runtime_error(hipGetErrorName(e));
	}
}
int main(int argc, char** argv)
{
	try
	{
		hipModule_t hip_module;
		succeed(hipModuleLoad(&hip_module, "kernels.hipfb"));

		hipFunction_t foo;
		succeed(hipModuleGetFunction(&foo, hip_module, "foo"));

		hipGraph_t graph;
		succeed(hipGraphCreate(&graph, 0));
		
		hipKernelNodeParams node_params = { 0 };
		node_params.func = (void*)foo;
		node_params.gridDim = dim3(1, 1, 1);
		node_params.blockDim = dim3(256, 1, 1);
		node_params.sharedMemBytes = 0;
		node_params.kernelParams = nullptr;
		node_params.extra = nullptr;
		
		hipGraphNode_t node;
		try
		{
			succeed(hipGraphAddKernelNode(&node, graph, nullptr, 0, &node_params));
		}
		catch(std::runtime_error const& e)
		{
			std::cout << "hipGraphAddKernelNode failed with " << e.what() << std::endl;
		}
		
		succeed(hipGraphDestroy(graph));
		succeed(hipModuleUnload(hip_module));
	}
	catch(...)
	{
		std::cout << "Something else broke" << std::endl;
	}
	
	return 0;
}

The code is built and run using:

hipcc kernels.cu --genco -o kernels.hipfb
hipcc hipGraphTest.cpp -o hipGraphTest
./hipGraphTest

Output:

hipGraphAddKernelNode failed with hipErrorInvalidDeviceFunction

hipconfig output:

HIP version  : 4.4.21432-f9dccde4

== hipconfig
HIP_PATH     : /opt/rocm-4.5.0/hip
ROCM_PATH    : /opt/rocm-4.5.0
HIP_COMPILER : clang
HIP_PLATFORM : amd
HIP_RUNTIME  : rocclr
CPP_CONFIG   :  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-4.5.0/hip/include -I/opt/rocm-4.5.0/llvm/bin/../lib/clang/13.0.0 -I/opt/rocm-4.5.0/hsa/include

== hip-clang
HSA_PATH         : /opt/rocm-4.5.0/hsa
HIP_CLANG_PATH   : /opt/rocm-4.5.0/llvm/bin
AMD clang version 13.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-4.5.0 21422 e2489b0d7ede612d6586c61728db321047833ed8)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-4.5.0/llvm/bin
LLVM (http://llvm.org/):
  LLVM version 13.0.0git
  Optimized build with assertions.
  Default target: x86_64-unknown-linux-gnu
  Host CPU: znver1

  Registered Targets:
    amdgcn - AMD GCN GPUs
    r600   - AMD GPUs HD2XXX-HD6XXX
    x86    - 32-bit X86: Pentium-Pro and above
    x86-64 - 64-bit X86: EM64T and AMD64
hip-clang-cxxflags :  -std=c++11 -isystem "/opt/rocm-4.5.0/llvm/lib/clang/13.0.0/include/.." -isystem /opt/rocm-4.5.0/hsa/include -isystem "/opt/rocm-4.5.0/hip/include" -O3
hip-clang-ldflags  : --driver-mode=g++ -L"/opt/rocm-4.5.0/hip/lib" -O3 -lgcc_s -lgcc -lpthread -lm -lrt

=== Environment Variables
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

== Linux Kernel
Hostname     : G5-5505
Linux G5-5505 5.11.0-40-generic #44~20.04.2-Ubuntu SMP Tue Oct 26 18:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.3 LTS
Release:	20.04
Codename:	focal

Does that mean that the module API is currently not compatible with the graph API? If so, (0) can someone please confirm that? (1) are there plans to support that?

PS.: Creating the graph via stream captures does not work in combination with the module API either, as it simply does not capture the kernel launches. But I'm pretty sure this is related, so I'm not gonna open another issue on that.

mlakardaniel avatar Dec 29 '21 14:12 mlakardaniel

(0) can someone please confirm that?

We have observed similar behaviour when porting a CUDA code which uses graphs in combination with functions loaded from modules to HIP. In CUDA a function pointer obtained from cuModuleGetFunction can be added as a node in a graph whereas in HIP a function pointer obtained from hipModuleGetFunction can not be.

FreddieWitherden avatar Jan 20 '22 13:01 FreddieWitherden

Just reconfirming this issue is still present in ROCm 5.

FreddieWitherden avatar Feb 13 '22 20:02 FreddieWitherden

The issue is from our ISV partner. It has been already confirmed that it is a bug. The bug is critical for their software product GPU AUDIO which is based on ROCm software stack.

Colleagues, please start looking into it as it is of P1 priority now.

@mangupta, if your team has already been faced with the bug and the team knows the workaround, please give your advice.

emankov avatar Mar 17 '22 13:03 emankov

The issue is from our ISV partner. It has been already confirmed that it is a bug. The bug is critical for their software product GPU AUDIO which is based on ROCm software stack.

Colleagues, please start looking into it as it is of P1 priority now.

@mangupta, if your team has already been faced with the bug and the team knows the workaround, please give your advice.

Just checking in to see if there has been any progress in this direction?

FreddieWitherden avatar Apr 26 '22 23:04 FreddieWitherden

Appears to be fixed in 5.2. Other issues with the graph API remain, but this specific problem is now resolved.

FreddieWitherden avatar Jul 01 '22 20:07 FreddieWitherden

For me hipGraphAddKernelNode still fails with 5.2. Although the error changed to:

hipGraphAddKernelNode failed with hipErrorInvalidValue

hipconfig output:

HIP version  : 5.2.21151-afdc89f8

== hipconfig
HIP_PATH     : /opt/rocm-5.2.0
ROCM_PATH    : /opt/rocm-5.2.0
HIP_COMPILER : clang
HIP_PLATFORM : amd
HIP_RUNTIME  : rocclr
CPP_CONFIG   :  -D__HIP_PLATFORM_HCC__= -D__HIP_PLATFORM_AMD__= -I/opt/rocm-5.2.0/include -I/opt/rocm-5.2.0/llvm/bin/../lib/clang/14.0.0 -I/opt/rocm-5.2.0/hsa/include

== hip-clang
HSA_PATH         : /opt/rocm-5.2.0/hsa
HIP_CLANG_PATH   : /opt/rocm-5.2.0/llvm/bin
AMD clang version 14.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.2.0 22204 50d6d5d5b608d2abd6af44314abc6ad20036af3b)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.2.0/llvm/bin
AMD LLVM version 14.0.0git
  Optimized build.
  Default target: x86_64-unknown-linux-gnu
  Host CPU: znver1

  Registered Targets:
    amdgcn - AMD GCN GPUs
    r600   - AMD GPUs HD2XXX-HD6XXX
    x86    - 32-bit X86: Pentium-Pro and above
    x86-64 - 64-bit X86: EM64T and AMD64
hip-clang-cxxflags :  -std=c++11 -isystem "/opt/rocm-5.2.0/llvm/lib/clang/14.0.0/include/.." -isystem /opt/rocm-5.2.0/hsa/include -isystem "/opt/rocm-5.2.0/include" -O3
hip-clang-ldflags  :  -L"/opt/rocm-5.2.0/lib" -O3 -lgcc_s -lgcc -lpthread -lm -lrt

=== Environment Variables
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

== Linux Kernel
Hostname     : G5-5505
Linux G5-5505 5.11.0-40-generic #44~20.04.2-Ubuntu SMP Tue Oct 26 18:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.4 LTS
Release:	20.04
Codename:	focal

mlakardaniel avatar Jul 14 '22 07:07 mlakardaniel

I have started seeing this issue again with 5.4.0.

FreddieWitherden avatar Jan 17 '23 14:01 FreddieWitherden

I can reproduce the hipErrorInvalidValue error with 5.6 as well. It looks like it comes from a check in hipGraphAddKernelNode for kernelParams and extra arguments not be set to nullptr at the same time:

https://github.com/ROCm-Developer-Tools/clr/blob/develop/hipamd/src/hip_graph.cpp#L78 i.e For this particular example, changing to a non void arg kernel and set the kernelParams should allow the test to go through.

However, I am not sure what is the reason for the check above and why HIP returns an error in this case, CUDA does not have this constraint. Will investigate more.

iassiour avatar Aug 30 '23 23:08 iassiour

I have created an internal PR to fix this, kernelParams and extra arguments should be allowed to be both set to nullptr if the kernel does not expect any arguments like in this example. Once the change is reviewed the change should appear in the external clr repo https://github.com/ROCm-Developer-Tools/clr. Will keep this issue open until that is done.

iassiour avatar Sep 04 '23 09:09 iassiour

@iassiour Has this issue been fixed? If so, can we close this ticket? Thanks!

ppanchad-amd avatar Apr 03 '24 14:04 ppanchad-amd

The specific error appears to have been fixed although graph support is still buggy (see PyFR #312)

FreddieWitherden avatar Apr 03 '24 22:04 FreddieWitherden