HIP icon indicating copy to clipboard operation
HIP copied to clipboard

"Undefined __global__ function" in PyTorch

Open aclex opened this issue 5 years ago • 6 comments

I'm experimenting with building PyTorch against ROCm, which in turn built with ebuilds from here on Gentoo. I realize, that this is completely unsupported configuration, but could you please suggest me, where to find the inconsistency in the build with the following problem. The problem is that every call to pretty much any function I tried in torch.nn.functional fails with RuntimeError: Undefined __global__ function.. Minimal reproducing code is just:

import torch
import torch.nn.functional as F

x = torch.tensor([1, 2], device="cuda")
v = F.relu(x)

I've found some similar error output in rocFFT issue, but can't really tell if it's connected to rocFFT or not.

Thanks in advance for any help or information!

aclex avatar Jan 28 '20 22:01 aclex

This would be better filed in ROCmSoftwarePlatform/pytorch.

Can you comment how you built PyTorch exactly - indeed it is unsupported on gentoo and the PT build infrastructure is rather complex so I suspect some issue there.

iotamudelta avatar Jan 30 '20 23:01 iotamudelta

@iotamudelta thanks for your attention and suggestions! Yes, I'll file a question there as well with the reference here.

As for the building, I pretty much replicate the building process in .jenkins directory for AMD way, but compiling C++ part first with CMake and then Python part on top of it, following this ebuild. So I'm just trying to approach the problem, at least find a way to compile the kernel manually.

I also tend to think the problem is in my PyTorch build, rather than in ROCm parts builds, as, for example, this test example works fine.

Anyway, thank you very much for your help.

aclex avatar Jan 30 '20 23:01 aclex

OK, that's a bit hard for me to map to the way we do things. So some general questions. You seem to supply USE_ROCM=1 to the cmake parts - that's correct. Do you also make sure to invoke hipification prior? Do you supply USE_ROCM=1 to the setup.py?

iotamudelta avatar Jan 31 '20 00:01 iotamudelta

Yes, in case of ROCm build I both set USE_ROCM=1 and perform tools/amd_build/build_amd.py script to do hipification. It builds quite fine, but I'm not sure about passing USE_ROCM=1 to setup.py afterwards — will double-check it to be sure. The building process itself finishes successfully, the problem is only on runtime.

aclex avatar Jan 31 '20 06:01 aclex

@aclex, do you still see the issue with the latest Rocm?

SarbojitAMD avatar Jun 29 '22 09:06 SarbojitAMD

Can't confirm it for 5.0, haven't built PyTorch against it yet, unfortunately. Feel free to close it for now, I'll reopen if it's still there.

aclex avatar Jun 29 '22 15:06 aclex

Closing. Please re-open if it occurs with latest ROCm 6.0.2 (HIP 6.0.32831)

ppanchad-amd avatar Mar 18 '24 19:03 ppanchad-amd