ROCm 7900 XTX Refuses to Run tensorflow-rocm Toy Example

Issue Type

Bug

Tensorflow Version

Tensorflow-rocm v2.11.0-3797-gfe65ef3bbcf 2.11.0

rocm Version

5.4.1

Custom Code

Yes

OS Platform and Distribution

Archlinux: Kernel 6.1.1

Python version

3.10

GPU model and memory

7900 XTX 24GB

Current Behaviour?

I am not entirely sure whether this is an upstream (ROCM) issue, or with Tensorflow-rocm specifically, so I am reporting it to both repo's. A toy example refuses to run and dumps core. I would have expected it to train successfully.

Standalone code to reproduce the issue

import tensorflow as tf
import numpy as np

features = np.random.randn(10000,25)
targets = np.random.randn(10000)

model = tf.keras.Sequential([
     tf.keras.layers.Dense(1)
])

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
              loss=tf.keras.losses.MeanSquaredError())

model.fit(x=features, y=targets)

Relevant log output

[jaap@Jaap-Desktop code]$ pipenv run python testNN.py
2022-12-24 11:18:37.178811: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
python: /build/hsa-rocr/src/ROCR-Runtime-rocm-5.4.1/src/core/runtime/amd_gpu_agent.cpp:339: void rocr::AMD::GpuAgent::AssembleShader(const char*, AssembleTarget, void*&, size_t&) const: Assertion `code_buf != NULL && "Code buffer allocation failed"' failed.

Dec 24 '22 10:12 Mushoz

It's probably a packaging issue for Arch, try with opencl-amd and opencl-amd-dev from AUR and see if it makes a difference.

p.s damn that GPU must be a beast 💯

Dec 24 '22 11:12 sofiageo

Unfortunately that doesn't seem to work. First it tries to remove the conflicting packages:

:: opencl-amd and rocm-opencl-runtime are in conflict. Remove rocm-opencl-runtime? [y/N] y
:: opencl-amd and hip-runtime-amd are in conflict (hip). Remove hip-runtime-amd? [y/N] y

However, answering Y to both question still results in a failure to install:

error: failed to commit transaction (conflicting files)
opencl-amd: /opt/rocm exists in filesystem
Errors occurred, no packages were upgraded.
 -> exit status 1

Are you sure these packages are even required though? From what I understand, tensorflow-rocm does NOT use opencl at all. As a matter of fact, I upgraded from a 6900XT which was able to run tensorflow-rocm with the exact same packages I have currently installed just fine.

Dec 24 '22 12:12 Mushoz

The package name is just that for historical reasons, nothing to do with OpenCL. The reason you get these conflicts errors is because it's not properly handling the conflicts. It's something I will try to fix soon but it's not there yet. So you have to manually remove any rocm-arch package yourself if you want to try opencl-amd.

p.s I don't want to spam the rocm issue tracker with arch packaging comments, so if you are still interested to try it feel free to comment on the AUR page and we can continue the discussion there.

Dec 24 '22 12:12 sofiageo

I just uninstalled all previous rocm packages and went with the opencl-amd + opencl-amd-dev, but that's just making the example run on the CPU rather than the GPU. So unfortunately it does not fix the issue at hand. Any ideas? :)

Dec 24 '22 14:12 Mushoz

I guess it's because your GPU is not supported yet in ROCm. I ran your example with my 5700 XT and it's working fine (although it didn't complete in 10 minutes and I had to cancel it). Maybe you can try to HSA_OVERRIDE_GFX_VERSION=10.3.0 python sample.py or something similar.

Dec 24 '22 15:12 sofiageo

That just makes it crash with an out of memory error, which is bogus for such a small example with 24GB memory:

[jaap@Jaap-Desktop code]$ HSA_OVERRIDE_GFX_VERSION=10.3.0 pipenv run python testNN.py
2022-12-25 20:49:47.446031: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-25 20:49:48.428818: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-25 20:49:48.466946: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-25 20:49:48.466999: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-25 20:49:48.467209: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-25 20:49:48.468937: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-25 20:49:48.469011: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-25 20:49:48.469044: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-25 20:49:48.469138: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-25 20:49:48.469176: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-25 20:49:48.469209: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-25 20:49:48.469229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 24060 MB memory:  -> device: 0, name: AMD Radeon Graphics, pci bus id: 0000:2d:00.0
2022-12-25 20:49:48.492206: E tensorflow/compiler/xla/stream_executor/rocm/rocm_driver.cc:573] could not allocate ROCM stream for device 0: HIP_ERROR_OutOfMemory
2022-12-25 20:49:48.492218: I tensorflow/compiler/xla/stream_executor/stream_executor_pimpl.cc:791] failed to allocate stream; live stream count: 1
2022-12-25 20:49:48.492221: E tensorflow/compiler/xla/stream_executor/stream.cc:297] failed to allocate stream during initialization
2022-12-25 20:49:48.512792: E tensorflow/compiler/xla/stream_executor/rocm/rocm_driver.cc:573] could not allocate ROCM stream for device 0: HIP_ERROR_OutOfMemory
2022-12-25 20:49:48.512801: I tensorflow/compiler/xla/stream_executor/stream_executor_pimpl.cc:791] failed to allocate stream; live stream count: 1
2022-12-25 20:49:48.512804: E tensorflow/compiler/xla/stream_executor/stream.cc:297] failed to allocate stream during initialization
2022-12-25 20:49:48.512811: I tensorflow/compiler/xla/stream_executor/stream.cc:1038] [stream=0x55edc5775fb0,impl=0x55edc5775770] did not wait for [stream=0x55edc5794720,impl=0x55edc5775e20]
2022-12-25 20:49:48.512815: I tensorflow/compiler/xla/stream_executor/stream.cc:1038] [stream=0x55edc5794720,impl=0x55edc5775e20] did not wait for [stream=0x55edc5775fb0,impl=0x55edc5775770]
2022-12-25 20:49:48.533248: E tensorflow/compiler/xla/stream_executor/rocm/rocm_driver.cc:573] could not allocate ROCM stream for device 0: HIP_ERROR_OutOfMemory
2022-12-25 20:49:48.533265: I tensorflow/compiler/xla/stream_executor/stream_executor_pimpl.cc:791] failed to allocate stream; live stream count: 1
2022-12-25 20:49:48.533270: E tensorflow/compiler/xla/stream_executor/stream.cc:297] failed to allocate stream during initialization
2022-12-25 20:49:48.553530: E tensorflow/compiler/xla/stream_executor/rocm/rocm_driver.cc:573] could not allocate ROCM stream for device 0: HIP_ERROR_OutOfMemory
2022-12-25 20:49:48.553539: I tensorflow/compiler/xla/stream_executor/stream_executor_pimpl.cc:791] failed to allocate stream; live stream count: 1
2022-12-25 20:49:48.553543: E tensorflow/compiler/xla/stream_executor/stream.cc:297] failed to allocate stream during initialization
2022-12-25 20:49:48.573939: E tensorflow/compiler/xla/stream_executor/rocm/rocm_driver.cc:573] could not allocate ROCM stream for device 0: HIP_ERROR_OutOfMemory
2022-12-25 20:49:48.573949: I tensorflow/compiler/xla/stream_executor/stream_executor_pimpl.cc:791] failed to allocate stream; live stream count: 1
2022-12-25 20:49:48.573953: E tensorflow/compiler/xla/stream_executor/stream.cc:297] failed to allocate stream during initialization
2022-12-25 20:49:48.582885: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-25 20:49:48.668485: I tensorflow/compiler/xla/stream_executor/stream.cc:1038] [stream=0x55edc57947d0,impl=0x55edc5794920] did not wait for [stream=0x55edc5775fb0,impl=0x55edc5775770]

Dec 25 '22 19:12 Mushoz

7900xtx with rocm would be awsome!!!! @Mushoz do you get it working now? I have the same usecase

Dec 29 '22 16:12 Syntax3rror404

The problem also occurs with 7900xt. Also with arch linux rocm packages from aur. Is there anything that can be done in order to make it run?

Edit: I reproduced the same output with the samples/0_Intro/bit_extract in https://github.com/ROCm-Developer-Tools/HIP.git as an easier minimal example.

Dec 29 '22 16:12 jannesklee

So this means this problem are only exits on arch linux? And not on ubuntu or debian?

Dec 29 '22 16:12 Syntax3rror404

Installing opencl-amd and opencl-amd-dev seems to work for me.

@Mushoz Did you install llvm with version >= 15 (arch still has 14)

You can also have a look at: https://www.phoronix.com/review/rx7900xt-rx7900xtx-linux https://www.reddit.com/r/linux_gaming/comments/zk0462/amd_radeon_rx_7900_xtx_rx_7900_xt_linux_support/

There it states what is needed:

llvm >= 15
new mesa version compiled against llvm >= 15
the firmware needed to be added manually, but I think it is now already included (at least in arch)

Dec 29 '22 16:12 jannesklee

@jannesklee I am running llvm-minimal-git. Everything is working as it should game-wise. It's just that rocm is broken. Are you able to run the example in my first post just fine? And you are certain it's running on the GPU and not the CPU? Could you run the following python script and show the output?

import tensorflow as tf
print(tf.config.list_physical_devices('GPU'))

Dec 29 '22 17:12 Mushoz

I got the same error when testing the minimal example shown above, and other samples and it vanished when I used the other packages. When I check the usage with nvtop it shows me that the dedicated graphic card is in use.

Maybe the llvm-minimal-git version is not enough. At https://aur.archlinux.org/pkgbase/llvm-git Lone_Wolf states that llvm-minimal-git focuses on providing stuff needed for AUR mesa-git. Doesn't support cross-compiling or any bindings for external stuff like ocaml & python.

Unfortunately I am currently not capable to install tensorflow, because I get compilation errors, but this is something else I guess. I try to make it run but without success.

Dec 29 '22 17:12 jannesklee

@jannesklee no need to compile tensorflow. You can install tensorflow-rocm via pip or pipenv if you want to keep it contained within its own virtual environment. Would you mind running my previously mentioned script?

Dec 29 '22 17:12 Mushoz

My output is. I do not completely understand it to be honest..

python samply.py 
2022-12-29 19:22:12.450100: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-29 19:22:12.510354: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-12-29 19:22:13.488612: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-29 19:22:13.488649: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-29 19:22:13.521375: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-29 19:22:13.521408: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-29 19:22:13.521421: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-29 19:22:13.521438: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1990] Ignoring visible gpu device (device: 0, name: AMD Radeon Graphics, pci bus id: 0000:16:00.0) with AMDGPU version : gfx1100. The supported AMDGPU versions are gfx1030, gfx900, gfx906, gfx908, gfx90a.
2022-12-29 19:22:13.521448: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:843] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-29 19:22:13.521454: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1990] Ignoring visible gpu device (device: 1, name: AMD Radeon Graphics, pci bus id: 0000:38:00.0) with AMDGPU version : gfx1036. The supported AMDGPU versions are gfx1030, gfx900, gfx906, gfx908, gfx90a.
2022-12-29 19:22:13.521638: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-29 19:22:13.531205: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.531933: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.532580: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.534466: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.534728: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.535002: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.537283: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.538129: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.540706: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.541347: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.546382: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.546865: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.551241: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.551819: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.552166: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.552624: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.555412: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.555920: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.556342: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.556773: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.557366: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.558349: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.558775: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.559037: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.562307: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.565317: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.567121: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.579977: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.580617: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.581283: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.581875: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.583109: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.583446: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.583800: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.584342: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.584655: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.683227: E tensorflow/core/framework/node_def_util.cc:675] NodeDef mentions attribute grad_a which is not in the op definition: Op<name=_MklMatMul; signature=a:T, b:T -> product:T; attr=transpose_a:bool,default=false; attr=transpose_b:bool,default=false; attr=T:type,allowed=[DT_BFLOAT16, DT_FLOAT]> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node gradient_tape/sequential/dense/MatMul/MatMul}}
2022-12-29 19:22:13.684957: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
190/313 [=================>............] - ETA: 0s - loss: 2.4990 2022-12-29 19:22:13.768593: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.768931: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2022-12-29 19:22:13.769222: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
313/313 [==============================] - 0s 257us/step - loss: 2.1766

Dec 29 '22 18:12 jannesklee

Support for this GPU is not enabled on ROCm 5.4.1. Please await the 5.5.0 release announcement to check for support.

Dec 29 '22 18:12 saadrahim

When we can expect a release of 5.5.0 are there any date scheduled?

Dec 29 '22 18:12 Syntax3rror404

@jannesklee I have the same output. Unfortunately it specifically states that it is ignoring the GPU because it is unsupported.

@saadrahim when can we expect 5.5.0 to release? CUDA is so much easier in this regard. It just works. In order for ROCM to be able to compete with CUDA it really has to step up in terms of communication so that users can rely on ROCM as they can on CUDA

Dec 29 '22 20:12 Mushoz

I'm a bit surprised that you're having trouble with ROCm 5.4.1 on the 7900 XTX, as that architecture is gfx1100 and most of the AMD-provided binaries for ROCm 5.4.1 contain gfx1100 code objects. It's not listed as officially supported in the GPU support table for ROCm 5.4, but I would have expected it would mostly work anyway. Is this problem specific to Tensorflow? e.g., do other libraries packaged by Arch work? A quick check might be to build and run Arch's test.cpp for rocrand.

I guess it's because your GPU is not supported yet in ROCm. I ran your example with my 5700 XT and it's working fine (although it didn't complete in 10 minutes and I had to cancel it). Maybe you can try to HSA_OVERRIDE_GFX_VERSION=10.3.0 python sample.py or something similar.

When you set HSA_OVERRIDE_GFX_VERSION=10.3.0, you're telling libhsakmt to pretend that your GPU is Navi 21 (gfx1030). To my knowledge, that works just fine for all the RDNA 2 GPUs, since they all use the same instruction set.

The RDNA 1 instruction sets are similar enough to the RDNA 2 instruction set that sometimes you can successfully run code that was compiled for RDNA 2 on an RDNA 1 GPU (as you are doing with your 5700 XT), however, this is not guaranteed to work. The instruction sets are not identical and if the code you're running happens to use an RDNA 2 instruction that worked differently in RDNA 1 (or doesn't exist at all in RDNA 1), then your program may not function correctly.

Similarly, the RDNA 3 instruction sets are different from the RDNA 2 instruction set. If you try to run code compiled for RDNA 2 on an RDNA 3 GPU using HSA_OVERRIDE_GFX_VERSION, the result may not work correctly.

Jan 03 '23 22:01 cgmb

My assumption is also that it is a problem from tensorflow side. I tested above the samples from https://github.com/ROCm-Developer-Tools/HIP

Example bit_extract:

    make
    ./bit_extract

gives me

    info: running on device #0 AMD Radeon Graphics
    info: allocate host mem (  7.63 MB)
    info: allocate device mem (  7.63 MB)
    info: copy Host2Device
    info: launch 'bit_extract_kernel' 
    info: copy Device2Host
    info: check result
    PASSED!

I can also see some activity with nvtop, but unfortunately I do not know exactly how to give more details here.

Regarding your example I unfortunately get a core dump, when running ./test.sh:

In file included from test.cpp:1:
In file included from /opt/rocm-5.4.1/include/hiprand/hiprand.hpp:35:
In file included from /opt/rocm-5.4.1/include/hiprand/hiprand_kernel.h:54:
In file included from /opt/rocm-5.4.1/include/hiprand/hiprand_kernel_hcc.h:37:
In file included from /opt/rocm-5.4.1/include/rocrand/rocrand_kernel.h:28:
/opt/rocm-5.4.1/include/rocrand/rocrand_common.h:74:6: warning: "Disabled inline asm, because the build target does not support it." [-W#warnings]
    #warning "Disabled inline asm, because the build target does not support it."
     ^
1 warning generated when compiling for gfx1036.
In file included from test.cpp:1:
In file included from /opt/rocm-5.4.1/include/hiprand/hiprand.hpp:35:
In file included from /opt/rocm-5.4.1/include/hiprand/hiprand_kernel.h:54:
In file included from /opt/rocm-5.4.1/include/hiprand/hiprand_kernel_hcc.h:37:
In file included from /opt/rocm-5.4.1/include/rocrand/rocrand_kernel.h:28:
/opt/rocm-5.4.1/include/rocrand/rocrand_common.h:74:6: warning: "Disabled inline asm, because the build target does not support it." [-W#warnings]
    #warning "Disabled inline asm, because the build target does not support it."
     ^
1 warning generated when compiling for gfx1100.
./test.sh: line 5:  7225 Segmentation fault      (core dumped) "$OUT"/test

Jan 04 '23 11:01 jannesklee

@jannesklee I am not so sure. @saadrahim Specifically stated that ROCM 5.5.0 is required for these cards to run tensorflow. I am also not surprised you are able to run that HIP example. There is some preliminary support for the 7900 series, given that Blender can also use the HIP backend just fine: https://www.phoronix.com/review/rx7900-blender-opencl

That has me thinking though. It would be interesting to see if pytorch-rocm is able to run. I can see that there are docker images available, and some tags are using rocm 5.4.1. That would take packaging issues AND tensorflow out of the equation, and would allow us to see if these cards are able to do any machine learning with the current rocm stack. I might try this out tonight.

Docker images in case you want to give it a shot: https://hub.docker.com/r/rocm/pytorch/tags

Jan 04 '23 11:01 Mushoz

@jannesklee did it work ?

Jan 09 '23 17:01 AndersStendevad

@Mushoz pytorch-rocm doesn't appear to work, either. Can't find the GPU at all by default and segfaults with HSA_OVERRIDE_GFX_VERSION set.

Jan 11 '23 12:01 wsippel

@wsippel Ah, I just replied to you on the AUR but only just now realized you are active here as well. A week ago changes for RDNA3 were merged for MIOpen: https://github.com/ROCmSoftwarePlatform/MIOpen/commits/develop

See the 11th of January. Do you reckon we could get it to work by compiling MIOpen from source?

Jan 19 '23 15:01 Mushoz

@wsippel @Mushoz I can confirm that with some effort a build of pytorch 1.13.1 against AMD RX 7900 XTX with ROCm 5.4.2 works and is functional for my use case of running models.

Rough outline for build is the usage of an Ubuntu (20.04/22.04) Docker image as AMD provides ROCm repos for it and installing all required deps without kernel module. See https://github.com/ROCmSoftwarePlatform/MIOpen/blob/develop/Dockerfile#L67 basically edit 5.3 to 5.4.2 and run all commands till line 67. I also adapted the amdgpu install command to amdgpu-install -y --usecase=graphics,rocm,lrt,hip,hiplibsdk --no-dkms as some libs were missing for the torch build.

Maybe you can build tensorflow via instructions from https://www.tensorflow.org/install/source and adapting the build command to (in venv): TF_NEED_ROCM=1 python configure.py

Jan 23 '23 01:01 Kardi5

@Kardi5 Would you mind sharing the final dockerfile that you used? I would love to try and replicate that for Tensorflow. Please leave in all the pytorch specific things as well. I will try to do something similar for Tensorflow.

Jan 23 '23 07:01 Mushoz

@Mushoz Sure, but I don't have a complete one myself right now. It was more of an interactive trial and error until all builds worked out. I hope to create a complete dockerfile tonight/tomorrow based on the notes I took.

Jan 23 '23 13:01 Kardi5

This issue also affects Gentoo when installing ROCm via portage. Installing dev-libs/rocm-opencl-runtime, which currently defaults to the older 5.3.3 will cause clinfo to raise the OPs error:

clinfo: /var/tmp/portage/dev-libs/rocr-runtime-5.3.3/work/ROCR-Runtime-rocm-5.3.3/src/core/runtime/amd_gpu_agent.cpp:339: void rocr::AMD::GpuAgent::AssembleShader(const char 
*, rocr::AMD::GpuAgent::AssembleTarget, void *&, size_t &) const: Assertion `code_buf != NULL && "Code buffer allocation failed"' failed.
Aborted (core dumped)

Im rather certain that this particular error is not related to TensorFlow or MIOpen, as I was able to repro the error above with only a basic installation of the Rocm OpenCL runtime and friends.

The changes from ROCR 5.4.1 to 5.4.2 have not been downstreamed to GitHub yet, making it tricky to reproduce the workaround @Kardi5 proposed for other distros. I guess I'll try with 5.4.1 for now.

Jan 24 '23 15:01 aaronmondal

@Mushoz So far I could only create a rough draft of a complete Dockerfile. Maybe you will find it useful nonetheless. Current main problem is that my compilation of Magma shows a lot of error'd calls to ROCm as during docker build I can not attach any device like I can during docker build.

Over at https://github.com/pytorch/pytorch/blob/master/.circleci/docker/ubuntu-rocm/Dockerfile there is a more complete example even though much more complex. Their Magma build script (https://github.com/pytorch/pytorch/blob/master/.circleci/docker/common/install_rocm_magma.sh) might be the solution to my troubles but I did not have time to look through it in more detail.

There might still be errors besides Magma building after line WORKDIR /build/magma/build

Draft Torch + Torchvision Dockerfile

FROM ubuntu:22.04

### START SECTION AMD ROCm install
# based on https://github.com/ROCmSoftwarePlatform/MIOpen/blob/develop/Dockerfile
ARG DEBIAN_FRONTEND=noninteractive
ARG USE_MLIR="OFF"

# Support multiarch
RUN dpkg --add-architecture i386

# Install preliminary dependencies
RUN apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated \
    apt-utils \
    ca-certificates \
    curl \
    libnuma-dev \
    gnupg2 \
    wget

#Add gpg keys
ENV APT_KEY_DONT_WARN_ON_DANGEROUS_USAGE=DontWarn
RUN apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys 9386B48A1A693C5C && \
    wget -q -O - https://repo.radeon.com/rocm/rocm.gpg.key | apt-key add -

# Check the AMD repo for exact package name
RUN wget https://repo.radeon.com/amdgpu-install/5.4.2/ubuntu/jammy/amdgpu-install_5.4.50402-1_all.deb --no-check-certificate
RUN apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated \
    ./amdgpu-install_5.4.50402-1_all.deb

# Add rocm repository
# Note: The ROCm version with $USE_MLIR should keep in sync with default ROCm version
# unless MLIR library is incompatible with current ROCm.
RUN export ROCM_APT_VER=5.4.2;\
echo $ROCM_APT_VER &&\
sh -c 'echo deb [arch=amd64 trusted=yes] http://repo.radeon.com/rocm/apt/$ROCM_APT_VER/ ubuntu main > /etc/apt/sources.list.d/rocm.list'
RUN sh -c "echo deb http://mirrors.kernel.org/ubuntu jammy main universe | tee -a /etc/apt/sources.list"

RUN amdgpu-install -y --usecase=rocm,graphics,rocmdev,rocmdevtools,lrt,hip,hiplibsdk,mllib,mlsdk --no-dkms

# Install dependencies
RUN apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated \
    build-essential \
    cmake \
    clang-format-12 \
    doxygen \
    gdb \
    git \
    lcov \
    libncurses5-dev \
    llvm-amdgpu \
    miopengemm \
    pkg-config \
    python3-dev \
    python3-pip \
    python3-venv \
    rocblas \
    rpm \
    software-properties-common

# Setup ubsan environment to printstacktrace
ENV UBSAN_OPTIONS=print_stacktrace=1
ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8

### START SECTION install Magma (torch dep) and PyTorch deps
# For Magma
RUN apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated \
    libmkl-core libmkl-def libmkl-dev libmkl-full-dev libmkl-intel-thread libmkl-gnu-thread gfortran

# For PyTorch
RUN apt-get update && \ 
DEBIAN_FRONTEND=noninteractive apt install -y --no-install-recommends --allow-unauthenticated \
    build-essential ca-certificates ccache cmake curl git libjpeg-dev libpng-dev && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

### START SECTION Magma and Torch build
RUN useradd -m -G video -U --shell /bin/bash roc && \
    mkdir /build && \
    chown roc:roc /build
USER roc
WORKDIR /build

# Download latest Magma version: http://icl.utk.edu/projectsfiles/magma/downloads/
# Install steps found here: https://salsa.debian.org/science-team/magma/-/tree/master/
RUN wget -qnc "https://icl.utk.edu/projectsfiles/magma/downloads/magma-2.7.0.tar.gz" -O "magma.tar.gz" && \
    tar -xzf magma.tar.gz && \
    rm magma.tar.gz && \
    mv magma* magma && \
    mkdir magma/build

WORKDIR /build/magma/build

# ERRORS START HERE, RUN THE REST OF THIS INTERACTIVELY

# You may want to adopt gfx1100 to something else: https://llvm.org/docs/AMDGPUUsage.html#processors search gfx11
RUN cmake -DMAGMA_ENABLE_HIP=ON -DCMAKE_CXX_COMPILER=hipcc -DGPU_TARGET='gfx1100' .. && \
    make -j $(nproc)

USER root
RUN make install
USER roc
WORKDIR /build
CMD git clone -j 4 --recursive https://github.com/pytorch/pytorch --depth 1 --branch v1.13.1

# Build of Torch based on: https://github.com/pytorch/pytorch/blob/master/Dockerfile
# Miniconda is experimental here, maybe use Anaconda if run interactively
RUN curl -fsSL -v -o ~/miniconda.sh -O  "https://repo.anaconda.com/miniconda/Miniconda3-py39_22.11.1-1-Linux-x86_64.sh" && \
    RUN chmod +x ~/miniconda.sh && \
    ~/miniconda.sh -b -p /opt/conda && \
    rm ~/miniconda.sh && \
    /opt/conda/bin/conda install -y python=3.9 cmake conda-build pyyaml numpy ipython && \
    /opt/conda/bin/python -mpip install -r /build/pytorch/requirements.txt && \
    /opt/conda/bin/conda install -y ninja cffi dataclasses && \
    /opt/conda/bin/conda install -y mkl mkl-include && \
    /opt/conda/bin/conda clean -ya

WORKDIR /build/pytorch
RUN python tools/amd_build/build_amd.py

RUN --mount=type=cache,target=/opt/ccache \
    export CMAKE_PREFIX_PATH="$(dirname $(which conda))/../",/usr/local/magma/ \
    PYTORCH_ROCM_ARCH=gfx1100 USE_MAGMA=1 USE_ROCM=1 USE_NVCC=0 USE_CUDA=0 python setup.py install

# Test build of Torch
# Should print: True Radeon RX 7900 XTX
RUN python3 -c 'import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(torch.cuda.current_device()))'

# Torchvision build
WORKDIR /build
RUN git clone --recursive https://github.com/pytorch/vision --depth 1 --branch v0.14.1
WORKDIR /build/vision
RUN python setup.py install
WORKDIR /build
RUN rm -rf pytorch && rm -rf vision

Build with docker build . -t rocmbuild:1

Run interactively with: docker run -d --network=host --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 8G rocmbuild:1 sleep 400000 (hacky, but works+some volumes might be wanted)

Jan 25 '23 00:01 Kardi5

Can confirm that with HSA_OVERRIDE_GFX_VERSION=10.3.0 the issue seems to go away on Gentoo when unmasking the currently still pre-experimental Clang/LLVM 16 toolchain and adjusting the 5.3.3 ebuilds for the following package versions:

rocr-runtime-5.4.1  # 5.4.2 not yet available.
roct-thunk-interface-5.4.2
rocm-opencl-runtime-5.4.2
rocm-comgr-5.4.2
rocm-device-libs-5.4.2

So this issue should originate from one of these libraries.

The downside is that the gentoo Clang 16 toolchain is not able to build mesa due to rtti flag mismatch, so current usability may be limited. That's either a gentoo or mesa bug though.

Jan 25 '23 16:01 aaronmondal

@Kardi5 I can confirm this works. Thank you!

There are some mysterious bugs though (e.g. randomly popped up NaN tensors) I'll look further into them when I have time

Jan 26 '23 07:01 fruitpudding

ROCm ROCm copied to clipboard

7900 XTX Refuses to Run tensorflow-rocm Toy Example

Issue Type

Tensorflow Version

rocm Version

Custom Code

OS Platform and Distribution

Python version

GPU model and memory

Current Behaviour?

Standalone code to reproduce the issue

Relevant log output

ROCm
ROCm copied to clipboard