Hi the official docker images of pytorch and tf docker are avialble only for gfx900(Vega10-type GPU - MI25, Vega56, Vega64), gfx906 (Vega20-type GPU - MI50, MI60) and gfx908 (MI100), gfx90a (MI200) and gfx1030 (Navi21).

When does gfx90c support expected. Thanks

May 23 '22 12:05 shridharkini6

Hi, @shridharkini6!

Thanks for your request. Since I am not an employee at AMD, I have no insight into what is planned there internally. However, at least some amount of library coverage seems to be a prerequisite for extending the Docker images to this class of GPUs, which are integrated into the CPU (or an "APU" in AMD's lingo). However, I do not see any support for the gfx90c as a TARGET in any of the public libraries. See my pull request for an attempt at a complete overview of the state of library support. PyTorch uses RCCL and MIOpen to run on ROCm, and so does TensorFlow. MIOpen in turn uses rocBLAS as its backend. For the available TARGETs, see the CMakeLists.txt of rocBLAS and the CMakeLists.txt of RCCL, respectively. As you can see, there is no support for gfx90c and in fact no other APU.

This aligns with what can be gathered from public sources, namely that AMD is focussing on the products which the hyperscalers or supercomputer customers are currently buying. I personally think this is fair enough, as those customers seem to be rather feature-sensitive. Starting from those high-profile customers, consider the following leaky pipe of support:

Enterprise
("Instinct"-branded products intended for hyperscalers and supercomputer customers, usually sold in servers or racks)
Professional
("Radeon PRO"-branded products intended for CAD and such use cases, usually sold in workstations)
Desktop
("Radeon"-branded products intended for demanding users like gamers and video editors, sold as dGPU components or pre-built systems)
APUs
("Ryzen with Radeon Graphics"-branded products intended for lighter workloads like office PCs and thin/light laptops)

Things might change a bit with the Ryzen 7000 line of desktop processors, which are announced to include a chiplet-ish GPU in the IO die. Such an arrangement does not currently fit into this leaky support pipe, but I would also not hold my breath for any kind of revolution. My bet would be on support gradually improving, as it has (not without setbacks) in the past.

May 27 '22 21:05 Bengt

I do not think it is AMD's top priority to support an APU when even the Navi 22 and Navi 23 are not supported. Also, AMD did pull the plug on supporting APUs long time before. So I think quite frankly, to answer your question, it is... never.

May 28 '22 03:05 ffleader1

@ffleader1 that's not so clever move from AMD, because they have nothing positioned against Nvidia Jetson type of hardware. So we buy Nvidia APUs despite they're not very FOSS friendly.

May 31 '22 23:05 AGenchev

Here is a workaround to run pytorch on gfx90c. Just build pytorch for gfx900 and override gfx90c to gfx900.

Build pytorch
$ git clone https://github.com/pytorch/pytorch.git  
$ cd pytorch  
$ git submodule update --init --recursive
$ sudo pip3 install -r requirements.txt
$ sudo pip3 install enum34 numpy pyyaml setuptools typing cffi future hypothesis typing_extensions
$ sudo python3 tools/amd_build/build_amd.py
$ sudo PYTORCH_ROCM_ARCH=gfx900 USE_ROCM=1 MAX_JOBS=4 python3 setup.py install

Run an example
$ git clone https://github.com/pytorch/examples.git
$ cd examples/mnist
$ sudo pip3 install -r requirements.txt
$ sudo HSA_OVERRIDE_GFX_VERSION=9.0.0 python3 main.py
...
Train Epoch: 14 [51200/60000 (85%)]     Loss: 0.027863
Train Epoch: 14 [51840/60000 (86%)]     Loss: 0.017484
Train Epoch: 14 [52480/60000 (87%)]     Loss: 0.021983
Train Epoch: 14 [53120/60000 (88%)]     Loss: 0.003217
Train Epoch: 14 [53760/60000 (90%)]     Loss: 0.011038
Train Epoch: 14 [54400/60000 (91%)]     Loss: 0.007962
Train Epoch: 14 [55040/60000 (92%)]     Loss: 0.018526
Train Epoch: 14 [55680/60000 (93%)]     Loss: 0.001039
Train Epoch: 14 [56320/60000 (94%)]     Loss: 0.017513
Train Epoch: 14 [56960/60000 (95%)]     Loss: 0.028949
Train Epoch: 14 [57600/60000 (96%)]     Loss: 0.028286
Train Epoch: 14 [58240/60000 (97%)]     Loss: 0.064388
Train Epoch: 14 [58880/60000 (98%)]     Loss: 0.002042
Train Epoch: 14 [59520/60000 (99%)]     Loss: 0.002829

Test set: Average loss: 0.0280, Accuracy: 9921/10000 (99%)

Note: 1, Disable some power features for gfx90c sudo modprobe amdgpu ppfeaturemask=0xfff73fff 2, ROCm https://docs.amd.com/bundle/ROCm-Downloads-Guide-v5.0/page/ROCm_Installation.html 3, Pytorch branch: master commit: 815532d40c25e81d8c09b3c36403016bea394aee

Jun 08 '22 13:06 langyuxf

You can also use docker of pytorch on gfx90c. Just run like this. @shridharkini6

$ git clone https://github.com/pytorch/examples.git
$ cd examples/mnist
$ pip3 install -r requirements.txt
$ HSA_OVERRIDE_GFX_VERSION=9.0.0 python3 main.py

Note: Your video memory should be at least 2GB.

Jun 09 '22 10:06 langyuxf

You can also use docker of pytorch on gfx90c. Just run like this. @shridharkini6
$ git clone https://github.com/pytorch/examples.git
$ cd examples/mnist
$ pip3 install -r requirements.txt
$ HSA_OVERRIDE_GFX_VERSION=9.0.0 python3 main.py
Note: Your video memory should be at least 2GB.

Maybe you have not tried it but at least do u think your to method will work with unsupported GPUs, like gfx1031 for example.

Jun 09 '22 11:06 ffleader1

You can also use docker of pytorch on gfx90c. Just run like this. @shridharkini6
$ git clone https://github.com/pytorch/examples.git
$ cd examples/mnist
$ pip3 install -r requirements.txt
$ HSA_OVERRIDE_GFX_VERSION=9.0.0 python3 main.py
Note: Your video memory should be at least 2GB.
Maybe you have not tried it but at least do u think your to method will work with unsupported GPUs, like gfx1031 for example.

You may try, run like this.

$ HSA_OVERRIDE_GFX_VERSION=10.3.0 python3 main.py

Jun 09 '22 11:06 langyuxf

You can also use docker of pytorch on gfx90c. Just run like this. @shridharkini6
$ git clone https://github.com/pytorch/examples.git
$ cd examples/mnist
$ pip3 install -r requirements.txt
$ HSA_OVERRIDE_GFX_VERSION=9.0.0 python3 main.py
Note: Your video memory should be at least 2GB.
Maybe you have not tried it but at least do u think your to method will work with unsupported GPUs, like gfx1031 for example.
You may try, run like this.

$ HSA_OVERRIDE_GFX_VERSION=10.3.0 python3 main.py

wait I am a bit confused. Maybe I am missing something but your example is about running pytorch example right But how do u get Rocm to install on gfx90c or gfx1031in the first place? Thank you,

Jun 09 '22 13:06 ffleader1

You can also use docker of pytorch on gfx90c. Just run like this. @shridharkini6
$ git clone https://github.com/pytorch/examples.git
$ cd examples/mnist
$ pip3 install -r requirements.txt
$ HSA_OVERRIDE_GFX_VERSION=9.0.0 python3 main.py
Note: Your video memory should be at least 2GB.
Maybe you have not tried it but at least do u think your to method will work with unsupported GPUs, like gfx1031 for example.
You may try, run like this. $ HSA_OVERRIDE_GFX_VERSION=10.3.0 python3 main.py
wait I am a bit confused. Maybe I am missing something but your example is about running pytorch example right But how do u get Rocm to install on gfx90c or gfx1031in the first place? Thank you,

1, Docker with PyTorch and ROCm installed https://docs.amd.com/bundle/AMD-Deep-Learning-Guide-v5.1.3/page/Deep_Learning_Frameworks.html 2, ROCm Installation guide https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.0/page/Overview_of_ROCm_Installation_Methods.html

Jun 09 '22 14:06 langyuxf

You can also use docker of pytorch on gfx90c. Just run like this. @shridharkini6
$ git clone https://github.com/pytorch/examples.git
$ cd examples/mnist
$ pip3 install -r requirements.txt
$ HSA_OVERRIDE_GFX_VERSION=9.0.0 python3 main.py
Note: Your video memory should be at least 2GB.
Maybe you have not tried it but at least do u think your to method will work with unsupported GPUs, like gfx1031 for example.
You may try, run like this. $ HSA_OVERRIDE_GFX_VERSION=10.3.0 python3 main.py
wait I am a bit confused. Maybe I am missing something but your example is about running pytorch example right But how do u get Rocm to install on gfx90c or gfx1031in the first place? Thank you,
1, Docker with PyTorch and ROCm installed https://docs.amd.com/bundle/AMD-Deep-Learning-Guide-v5.1.3/page/Deep_Learning_Frameworks.html 2, ROCm Installation guide https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.0/page/Overview_of_ROCm_Installation_Methods.html

I have not tried docker but for rocm, I am pretty sure the install will only be successful if your GPU is supported. I.e the rocm installation will not work on a gfx1031 or lower.

Jun 09 '22 14:06 ffleader1

@xfyucg i followed your methods, looks to me training is using only CPU not GPU.

import torch
t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')

throws error like

Traceback (most recent call last): File "", line 1, in File "/opt/conda/lib/python3.7/site-packages/torch/cuda/init.py", line 216, in _lazy_init torch._C._cuda_init() RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Thanks

Jun 13 '22 08:06 shridharkini6

@xfyucg i followed your methods, looks to me training is using only CPU not GPU.
import torch
t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')
throws error like

Traceback (most recent call last): File "", line 1, in File "/opt/conda/lib/python3.7/site-packages/torch/cuda/init.py", line 216, in _lazy_init torch._C._cuda_init() RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Thanks

Run like this, that works well on my Cezanne platform.

lang@lang-test:~/Videos/pytorch$ HSA_OVERRIDE_GFX_VERSION=9.0.0 python3
Python 3.8.10 (default, Mar 15 2022, 12:22:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')
>>>

Jun 13 '22 08:06 langyuxf

@xfyucg i followed your methods, looks to me training is using only CPU not GPU.
import torch
t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')
throws error like

Traceback (most recent call last): File "", line 1, in File "/opt/conda/lib/python3.7/site-packages/torch/cuda/init.py", line 216, in _lazy_init torch._C._cuda_init() RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Thanks
Run like this, that works well on my Cezanne platform.
lang@lang-test:~/Videos/pytorch$ HSA_OVERRIDE_GFX_VERSION=9.0.0 python3
Python 3.8.10 (default, Mar 15 2022, 12:22:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')
>>> 

Tried this as well..ended up with same error

Jun 13 '22 09:06 shridharkini6

@shridharkini6

@xfyucg i followed your methods, looks to me training is using only CPU not GPU.
import torch
t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')
throws error like

Traceback (most recent call last): File "", line 1, in File "/opt/conda/lib/python3.7/site-packages/torch/cuda/init.py", line 216, in _lazy_init torch._C._cuda_init() RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Thanks
Run like this, that works well on my Cezanne platform.
lang@lang-test:~/Videos/pytorch$ HSA_OVERRIDE_GFX_VERSION=9.0.0 python3
Python 3.8.10 (default, Mar 15 2022, 12:22:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')
>>> 
Tried this as well..ended up with same error

Can you put the output of $ rocminfo here?

Jun 13 '22 09:06 langyuxf

@shridharkini6
@xfyucg i followed your methods, looks to me training is using only CPU not GPU.
import torch
t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')
throws error like

Traceback (most recent call last): File "", line 1, in File "/opt/conda/lib/python3.7/site-packages/torch/cuda/init.py", line 216, in _lazy_init torch._C._cuda_init() RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Thanks
Run like this, that works well on my Cezanne platform.
lang@lang-test:~/Videos/pytorch$ HSA_OVERRIDE_GFX_VERSION=9.0.0 python3
Python 3.8.10 (default, Mar 15 2022, 12:22:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')
>>> 
Tried this as well..ended up with same error
Can you put the output of $ rocminfo here?

ROCk module is loaded

HSA System Attributes

Runtime Version: 1.1 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE
System Endianness: LITTLE

==========
HSA Agents

Agent 1

Name: AMD Ryzen 7 4700U with Radeon Graphics Uuid: CPU-XX
Marketing Name: AMD Ryzen 7 4700U with Radeon Graphics Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2000
BDFID: 0
Internal Node ID: 0
Compute Unit: 8
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 7612028(0x74267c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 7612028(0x74267c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 7612028(0x74267c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:

Agent 2

Name: gfx90c
Uuid: GPU-XX
Marketing Name:
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 1024(0x400) KB
Chip ID: 5686(0x1636)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1600
BDFID: 1024
Internal Node ID: 1
Compute Unit: 7
SIMDs per CU: 4
Shader Engines: 1
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 524288(0x80000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx90c:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***

Jun 13 '22 09:06 shridharkini6

@shridharkini6 Are you using docker? If yes, try to start your docker like this.

sudo docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest

Jun 13 '22 13:06 langyuxf

@shridharkini6 Are you using docker? If yes, try to start your docker like this.
sudo docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest

I have tried the same..used rocm/pytorch:latest-base docker.

Jun 14 '22 08:06 shridharkini6

@shridharkini6 Are you using docker? If yes, try to start your docker like this.
sudo docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest
I have tried the same..used rocm/pytorch:latest-base docker.

According to https://docs.amd.com/bundle/AMD-Deep-Learning-Guide-v5.1.3/page/Deep_Learning_Frameworks.html Option 3: Install PyTorch Using PyTorch ROCm Base Docker Image

docker pull rocm/pytorch:latest-base

NOTE This will download the base container, which does not contain PyTorch

So please use rocm/pytorch:latest

docker pull rocm/pytorch:latest

docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest

sudo modprobe amdgpu ppfeaturemask=0xfff73fff

HSA_OVERRIDE_GFX_VERSION=9.0.0 python3

Jun 15 '22 05:06 langyuxf

@shridharkini6 Are you using docker? If yes, try to start your docker like this.
sudo docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest
I have tried the same..used rocm/pytorch:latest-base docker.
According to https://docs.amd.com/bundle/AMD-Deep-Learning-Guide-v5.1.3/page/Deep_Learning_Frameworks.html Option 3: Install PyTorch Using PyTorch ROCm Base Docker Image

docker pull rocm/pytorch:latest-base

NOTE This will download the base container, which does not contain PyTorch

So please use rocm/pytorch:latest
docker pull rocm/pytorch:latest

docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest

sudo modprobe amdgpu ppfeaturemask=0xfff73fff

HSA_OVERRIDE_GFX_VERSION=9.0.0 python3

His hardware is not supported, and so is your I think. APUs in general do not work. Docker won't change unsatisfied prerequisites hardware availability.

Jun 15 '22 05:06 ffleader1

@shridharkini6 Are you using docker? If yes, try to start your docker like this.
sudo docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest
I have tried the same..used rocm/pytorch:latest-base docker.
According to https://docs.amd.com/bundle/AMD-Deep-Learning-Guide-v5.1.3/page/Deep_Learning_Frameworks.html Option 3: Install PyTorch Using PyTorch ROCm Base Docker Image docker pull rocm/pytorch:latest-base NOTE This will download the base container, which does not contain PyTorch So please use rocm/pytorch:latest
docker pull rocm/pytorch:latest

docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest

sudo modprobe amdgpu ppfeaturemask=0xfff73fff

HSA_OVERRIDE_GFX_VERSION=9.0.0 python3
His hardware is not supported, and so is your I think. APUs in general do not work. Docker won't change unsatisfied prerequisites hardware availability.

No, they use same ISA with gfx900. So for gfx90c, just override it to gfx900. That actually works. He uses rocm/pytorch:latest-base, so he must build pytorch for rocm.

Jun 15 '22 05:06 langyuxf

@xfyucg i have followed all the procedures you suggested. i.e used rocm/pytorch:latest-base and compiled pytorch from source. but get the same error

Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/lib/python3.7/site-packages/torch/cuda/init.py", line 216, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Jun 15 '22 11:06 shridharkini6

@xfyucg i have followed all the procedures you suggested. i.e used rocm/pytorch:latest-base and compiled pytorch from source. but get the same error
Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/lib/python3.7/site-packages/torch/cuda/init.py", line 216, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

May be some environment issues, it's hard to debug. It's error-prone to build pytorch by yourself. Why not use rocm/pytorch:latest? It's simple and also the recommended way.

Jun 15 '22 12:06 langyuxf

@xfyucg i have followed all the procedures you suggested. i.e used rocm/pytorch:latest-base and compiled pytorch from source. but get the same error
Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/lib/python3.7/site-packages/torch/cuda/init.py", line 216, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
May be some environment issues, it's hard to debug. It's error-prone to build pytorch by yourself. Why not use rocm/pytorch:latest? It's simple and also the recommended way.

@xfyucg yes i tried with rocm/pytorch:latest also. it throws similar errors. i hope it could be issues with base libraries as @Bengt mentioned.

Jun 20 '22 05:06 shridharkini6

@xfyucg i have followed all the procedures you suggested. i.e used rocm/pytorch:latest-base and compiled pytorch from source. but get the same error
Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/lib/python3.7/site-packages/torch/cuda/init.py", line 216, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
May be some environment issues, it's hard to debug. It's error-prone to build pytorch by yourself. Why not use rocm/pytorch:latest? It's simple and also the recommended way.
@xfyucg yes i tried with rocm/pytorch:latest also. it throws similar errors. i hope it could be issues with base libraries as @Bengt mentioned.

No. If you install and start docker(rocm/pytorch:latest) correctly, you will get the error like following.

root@0f962c3a9d38:/var/lib/jenkins# python3
Python 3.7.13 (default, Mar 29 2022, 02:18:16)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"
Aborted (core dumped)
root@0f962c3a9d38:~#

After override gfx90c to gfx900

root@0f962c3a9d38:/var/lib/jenkins# HSA_OVERRIDE_GFX_VERSION=9.0.0 python3
Python 3.7.13 (default, Mar 29 2022, 02:18:16)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>>

Jun 20 '22 06:06 langyuxf

Make sure amdgpu kernel mode driver is installed. If you use a generic kernel on Ubuntu 20.04, install amdgpu kernel mode driver as following.

sudo apt-get update
wget https://repo.radeon.com/amdgpu-install/22.10.3/ubuntu/focal/amdgpu-install_22.10.3.50103-1_all.deb 
sudo apt-get install ./amdgpu-install_22.10.3.50103-1_all.deb

amdgpu-install --usecase=dkms

Jun 22 '22 06:06 langyuxf

Try updating your system's kernel to a version newer than 6.0 and run the commands setting the following environment variable:

HSA_OVERRIDE_GFX_VERSION=9.0.0

You can use export HSA_OVERRIDE_GFX_VERSION=9.0.0 in the shell you are running the commands to propagate the environment variable to child processes. That's what allowed the rocm/pytorch container to not crash on import or crash when doing simple tensor operations like torch.tensor([[1,2],[3,4]]).to(torch.device('cuda')).

I tested this on NixOS, branch 22.11, kernel 6.0.13 and latest rocm/pytorch container with a Ryzen 5600G.