HIP
HIP copied to clipboard
"Undefined __global__ function" in PyTorch
I'm experimenting with building PyTorch against ROCm, which in turn built with ebuilds from here on Gentoo. I realize, that this is completely unsupported configuration, but could you please suggest me, where to find the inconsistency in the build with the following problem. The problem is that every call to pretty much any function I tried in torch.nn.functional fails with RuntimeError: Undefined __global__ function.. Minimal reproducing code is just:
import torch
import torch.nn.functional as F
x = torch.tensor([1, 2], device="cuda")
v = F.relu(x)
I've found some similar error output in rocFFT issue, but can't really tell if it's connected to rocFFT or not.
Thanks in advance for any help or information!
This would be better filed in ROCmSoftwarePlatform/pytorch.
Can you comment how you built PyTorch exactly - indeed it is unsupported on gentoo and the PT build infrastructure is rather complex so I suspect some issue there.
@iotamudelta thanks for your attention and suggestions! Yes, I'll file a question there as well with the reference here.
As for the building, I pretty much replicate the building process in .jenkins directory for AMD way, but compiling C++ part first with CMake and then Python part on top of it, following this ebuild. So I'm just trying to approach the problem, at least find a way to compile the kernel manually.
I also tend to think the problem is in my PyTorch build, rather than in ROCm parts builds, as, for example, this test example works fine.
Anyway, thank you very much for your help.
OK, that's a bit hard for me to map to the way we do things. So some general questions. You seem to supply USE_ROCM=1 to the cmake parts - that's correct. Do you also make sure to invoke hipification prior? Do you supply USE_ROCM=1 to the setup.py?
Yes, in case of ROCm build I both set USE_ROCM=1 and perform tools/amd_build/build_amd.py script to do hipification. It builds quite fine, but I'm not sure about passing USE_ROCM=1 to setup.py afterwards — will double-check it to be sure. The building process itself finishes successfully, the problem is only on runtime.
@aclex, do you still see the issue with the latest Rocm?
Can't confirm it for 5.0, haven't built PyTorch against it yet, unfortunately. Feel free to close it for now, I'll reopen if it's still there.
Closing. Please re-open if it occurs with latest ROCm 6.0.2 (HIP 6.0.32831)