examples icon indicating copy to clipboard operation
examples copied to clipboard

"RuntimeError: HIP error: invalid device function" when running "mnist" on 7900XTX

Open SuGotLand opened this issue 8 months ago • 0 comments

Context

  • Pytorch version: 2.6.0+rocm6.2.4
  • Operating System and version: Ubuntu 24.04.2 LTS x86_64

Your Environment

  • Installed using source? [yes/no]: no
  • Are you planning to deploy it using docker container? [yes/no]: no
  • Is it a CPU or GPU environment?: GPU
  • Which example are you using: mnist
  • Link to code or data to repro [if any]: mnist

Expected Behavior

Train Epoch: 1 [0/60000 (0%)]	Loss: 2.326473
Train Epoch: 1 [640/60000 (1%)]	Loss: 1.377825
Train Epoch: 1 [1280/60000 (2%)]	Loss: 0.828890
Train Epoch: 1 [1920/60000 (3%)]	Loss: 0.623807
Train Epoch: 1 [2560/60000 (4%)]	Loss: 0.447925
Train Epoch: 1 [3200/60000 (5%)]	Loss: 0.293224
Train Epoch: 1 [3840/60000 (6%)]	Loss: 0.163648
Train Epoch: 1 [4480/60000 (7%)]	Loss: 0.633399
Train Epoch: 1 [5120/60000 (9%)]	Loss: 0.226126
Train Epoch: 1 [5760/60000 (10%)]	Loss: 0.226796
...

Current Behavior

Traceback (most recent call last):
  File "/home/USER/Desktop/PYTHON Document/examples/mnist/main.py", line 147, in <module>
    main()
  File "/home/USER/Desktop/PYTHON Document/examples/mnist/main.py", line 138, in main
    train(args, model, device, train_loader, optimizer, epoch)
  File "/home/USER/Desktop/PYTHON Document/examples/mnist/main.py", line 45, in train
    output = model(data)
             ^^^^^^^^^^^
  File "/home/USER/Desktop/PYTHON Document/PhyRevE/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/USER/Desktop/PYTHON Document/PhyRevE/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/USER/Desktop/PYTHON Document/examples/mnist/main.py", line 25, in forward
    x = self.conv1(x)
        ^^^^^^^^^^^^^
  File "/home/USER/Desktop/PYTHON Document/PhyRevE/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/USER/Desktop/PYTHON Document/PhyRevE/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/USER/Desktop/PYTHON Document/PhyRevE/.venv/lib/python3.12/site-packages/torch/nn/modules/conv.py", line 554, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/USER/Desktop/PYTHON Document/PhyRevE/.venv/lib/python3.12/site-packages/torch/nn/modules/conv.py", line 549, in _conv_forward
    return F.conv2d(
           ^^^^^^^^^
RuntimeError: HIP error: invalid device function
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

Possible Solution

export HIP_VISIBLE_DEVICES=1
export HSA_OVERRIDE_GFX_VERSION=11.0.0
export PYTORCH_ROCM_ARCH="gfx1100"

But it doesn't work for me.

Steps to Reproduce

  1. Install the lastest pytorch by pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2.4
  2. clone examples and cd the directory.
  3. python3 mnist/main.py

Failure Logs [if any]

Output of AMD_LOG_LEVEL=3 python main.py AMD_LOG.log

SuGotLand avatar Feb 18 '25 14:02 SuGotLand