xla XLA_TARGET=rocm compilation failed with "crosstool_wrapper_driver_is_not

Hi, I've been trying to get GPU support running, but I keep running into this issue. I was first looking at this issue to get it running. I added the dependencies like this:

# mix.exs
      {:nx, github: "elixir-nx/nx", sparse: "nx", override: true},
      {:exla, github: "elixir-nx/nx", sparse: "exla", override: true}

I made sure to install the dependencies mentioned in this comment (adjusted for arch linux):

$ sudo pacman -S miopen-hip hipfft rocrand \
    hipsparse  hipsolver hipsparse rccl hip-runtime-amd \
    rocfft roctracer hipblas rocm-device-libs rocsolver rocblas

And then tried to compile it with $ XLA_BUILD=true XLA_TARGET=rocm mix compile

Compilation logs


==> xla
Compiling 2 files (.ex)
Generated xla app
mkdir -p /home/awlex/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb && 

cd /home/awlex/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb && 

git init && 

git remote add origin https://github.com/openxla/xla.git && 

git fetch --depth 1 origin 771e38178340cbaaef8ff20f44da5407c15092cb && 

git checkout FETCH_HEAD && 

rm /home/awlex/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.bazelversion
Initialized empty Git repository in /home/awlex/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.git/
From https://github.com/openxla/xla

branch            771e38178340cbaaef8ff20f44da5407c15092cb -> FETCH_HEAD
Note: switching to 'FETCH_HEAD'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:
git switch -c 
Or undo this operation with:
git switch -
Turn off this advice by setting config variable advice.detachedHead to false
HEAD is now at 771e381 [XLA:GPU] Check tensor_float_32_execution_enabled() in Triton codegen too
rm -f /home/awlex/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/xla/extension && 

ln -s "/hdd/programming/elixir/fusemega/deps/xla/extension" /home/awlex/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/xla/extension && 

cd /home/awlex/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb && 

bazel build --define "framework_shared_object=false" -c opt   --config=rocm --action_env=HIP_PLATFORM=hcc --action_env=TF_ROCM_AMDGPU_TARGETS="gfx900,gfx906,gfx908,gfx90a,gfx1030" //xla/extension:xla_extension && 

mkdir -p /home/awlex/.cache/xla/0.6.0/cache/build/ && 

cp -f /home/awlex/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/bazel-bin/xla/extension/xla_extension.tar.gz /home/awlex/.cache/xla/0.6.0/cache/build/xla_extension-x86_64-linux-gnu-rocm.tar.gz
Starting local Bazel server and connecting to it...
INFO: Reading 'startup' options from /home/awlex/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.bazelrc: --windows_enable_symlinks
INFO: Options provided by the client:
Inherited 'common' options: --isatty=0 --terminal_columns=80
INFO: Reading rc options for 'build' from /home/awlex/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.bazelrc:
Inherited 'common' options: --experimental_repo_remote_exec
INFO: Reading rc options for 'build' from /home/awlex/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.bazelrc:
'build' options: --define framework_shared_object=true --define tsl_protobuf_header_only=true --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone -c opt --announce_rc --define=grpc_no_ares=true --noincompatible_remove_legacy_whole_archive --features=-force_no_whole_archive --enable_platform_specific_config --define=with_xla_support=true --config=short_logs --config=v2 --define=no_aws_support=true --define=no_hdfs_support=true --experimental_cc_shared_library --experimental_link_static_libraries_once=false --incompatible_enforce_config_setting_visibility
INFO: Found applicable config definition build:short_logs in file /home/awlex/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.bazelrc: --output_filter=DONT_MATCH_ANYTHING
INFO: Found applicable config definition build:v2 in file /home/awlex/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.bazelrc: --define=tf_api_version=2 --action_env=TF2_BEHAVIOR=1
INFO: Found applicable config definition build:rocm in file /home/awlex/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.bazelrc: --crosstool_top=@local_config_rocm//crosstool:toolchain --define=using_rocm_hipcc=true --define=tensorflow_mkldnn_contraction_kernel=0 --repo_env TF_NEED_ROCM=1 --config=no_tfrt
INFO: Found applicable config definition build:no_tfrt in file /home/awlex/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.bazelrc: --deleted_packages=tensorflow/compiler/mlir/tfrt,tensorflow/compiler/mlir/tfrt/benchmarks,tensorflow/compiler/mlir/tfrt/ir,tensorflow/compiler/mlir/tfrt/ir/mlrt,tensorflow/compiler/mlir/tfrt/jit/python_binding,tensorflow/compiler/mlir/tfrt/jit/transforms,tensorflow/compiler/mlir/tfrt/python_tests,tensorflow/compiler/mlir/tfrt/tests,tensorflow/compiler/mlir/tfrt/tests/ifrt,tensorflow/compiler/mlir/tfrt/tests/mlrt,tensorflow/compiler/mlir/tfrt/tests/ir,tensorflow/compiler/mlir/tfrt/tests/analysis,tensorflow/compiler/mlir/tfrt/tests/jit,tensorflow/compiler/mlir/tfrt/tests/lhlo_to_tfrt,tensorflow/compiler/mlir/tfrt/tests/lhlo_to_jitrt,tensorflow/compiler/mlir/tfrt/tests/tf_to_corert,tensorflow/compiler/mlir/tfrt/tests/tf_to_tfrt_data,tensorflow/compiler/mlir/tfrt/tests/saved_model,tensorflow/compiler/mlir/tfrt/transforms/lhlo_gpu_to_tfrt_gpu,tensorflow/compiler/mlir/tfrt/transforms/mlrt,tensorflow/core/runtime_fallback,tensorflow/core/runtime_fallback/conversion,tensorflow/core/runtime_fallback/kernel,tensorflow/core/runtime_fallback/opdefs,tensorflow/core/runtime_fallback/runtime,tensorflow/core/runtime_fallback/util,tensorflow/core/runtime_fallback/test,tensorflow/core/runtime_fallback/test/gpu,tensorflow/core/runtime_fallback/test/saved_model,tensorflow/core/runtime_fallback/test/testdata,tensorflow/core/tfrt/stubs,tensorflow/core/tfrt/tfrt_session,tensorflow/core/tfrt/mlrt,tensorflow/core/tfrt/mlrt/attribute,tensorflow/core/tfrt/mlrt/kernel,tensorflow/core/tfrt/mlrt/bytecode,tensorflow/core/tfrt/mlrt/interpreter,tensorflow/compiler/mlir/tfrt/translate/mlrt,tensorflow/compiler/mlir/tfrt/translate/mlrt/testdata,tensorflow/core/tfrt/gpu,tensorflow/core/tfrt/run_handler_thread_pool,tensorflow/core/tfrt/runtime,tensorflow/core/tfrt/saved_model,tensorflow/core/tfrt/graph_executor,tensorflow/core/tfrt/saved_model/tests,tensorflow/core/tfrt/tpu,tensorflow/core/tfrt/utils,tensorflow/core/tfrt/utils/debug,tensorflow/core/tfrt/saved_model/python,tensorflow/core/tfrt/graph_executor/python,tensorflow/core/tfrt/saved_model/utils
INFO: Found applicable config definition build:linux in file /home/awlex/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.bazelrc: --host_copt=-w --copt=-Wno-all --copt=-Wno-extra --copt=-Wno-deprecated --copt=-Wno-deprecated-declarations --copt=-Wno-ignored-attributes --copt=-Wno-array-bounds --copt=-Wunused-result --copt=-Werror=unused-result --copt=-Wswitch --copt=-Werror=switch --copt=-Wno-error=unused-but-set-variable --define=PREFIX=/usr --define=LIBDIR=$(PREFIX)/lib --define=INCLUDEDIR=$(PREFIX)/include --define=PROTOBUF_INCLUDE_PATH=$(PREFIX)/include --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --config=dynamic_kernels --experimental_guard_against_concurrent_changes
INFO: Found applicable config definition build:dynamic_kernels in file /home/awlex/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.bazelrc: --define=dynamic_loaded_kernels=true --copt=-DAUTOLOAD_DYNAMIC_KERNELS
Loading:
Loading: 0 packages loaded
DEBUG: /home/awlex/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/third_party/repo.bzl:132:14:
Warning: skipping import of repository 'llvm-raw' because it already exists.
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
Loading: 0 packages loaded
currently loading: xla/extension
Analyzing: target //xla/extension:xla_extension (1 packages loaded, 0 targets configured)
Analyzing: target //xla/extension:xla_extension (36 packages loaded, 14 targets configured)
Analyzing: target //xla/extension:xla_extension (36 packages loaded, 14 targets configured)
Analyzing: target //xla/extension:xla_extension (179 packages loaded, 13765 targets configured)
INFO: Analyzed target //xla/extension:xla_extension (182 packages loaded, 16076 targets configured).
INFO: Found 1 target...
[0 / 196] [Prepa] Writing script xla/extension/xla_extension_headers.genrule_script.sh
[44 / 4,961] Compiling src/google/protobuf/compiler/cpp/file.cc; 3s local ... (12 actions, 11 running)
[71 / 4,961] Compiling src/google/protobuf/compiler/cpp/field.cc; 1s local ... (12 actions, 11 running)
[97 / 4,961] Compiling src/google/protobuf/compiler/command_line_interface.cc; 4s local ... (12 actions, 11 running)
[133 / 4,961] Compiling src/google/protobuf/util/internal/protostream_objectsource.cc; 2s local ... (12 actions, 11 running)
[173 / 4,961] Compiling src/google/protobuf/descriptor.cc; 8s local ... (12 actions, 11 running)
[256 / 5,172] Compiling llvm/lib/TableGen/TGParser.cpp [for host]; 3s local ... (12 actions, 11 running)
[339 / 5,172] Compiling llvm/lib/Support/KnownBits.cpp [for host]; 2s local ... (12 actions, 11 running)
[448 / 5,400] Compiling llvm/lib/Support/VirtualFileSystem.cpp; 4s local ... (12 actions, 11 running)
[542 / 5,400] Compiling llvm/lib/Support/Caching.cpp; 1s local ... (12 actions, 11 running)
[897 / 6,732] Compiling mlir/tools/mlir-tblgen/OpDocGen.cpp; 4s local ... (12 actions, 11 running)
[958 / 6,732] Compiling llvm/utils/TableGen/GlobalISelCombinerEmitter.cpp [for host]; 12s local ... (12 actions, 11 running)
[1,130 / 6,732] Compiling mlir/tools/mlir-tblgen/OpDefinitionsGen.cpp [for host]; 10s local ... (12 actions, 11 running)
[1,362 / 6,732] Generating code from table: lib/Target/AMDGPU/AMDGPU.td @llvm-project//llvm:AMDGPUCommonTableGen__gen_register_bank_genrule; 22s local ... (12 actions, 11 running)
[1,813 / 6,732] Generating code from table: lib/Target/AMDGPU/AMDGPU.td @llvm-project//llvm:AMDGPUCommonTableGen__gen_asm_matcher_genrule; 16s local ... (12 actions, 11 running)
[2,160 / 6,907] Compiling xla/hlo/utils/hlo_sharding_util.cc; 5s local ... (12 actions running)
[2,285 / 6,907] Compiling xla/service/hlo_rematerialization.cc; 12s local ... (12 actions running)
[2,446 / 6,907] Compiling llvm/lib/IR/AutoUpgrade.cpp; 7s local ... (12 actions running)
[2,624 / 6,907] Compiling xla/service/gpu/cub_sort_kernel.cu.cc; 14s local ... (12 actions, 11 running)
[2,758 / 6,907] Compiling xla/service/gpu/cub_sort_kernel.cu.cc; 12s local ... (12 actions, 11 running)
[2,942 / 6,907] Compiling src/cpu/x64/gemm/f32/jit_avx2_f32_copy_an_kern_autogen.cpp; 8s local ... (12 actions, 11 running)
[3,150 / 6,907] Compiling src/cpu/x64/jit_uni_resampling_kernel.cpp; 13s local ... (12 actions, 11 running)
[3,366 / 6,907] Compiling src/cpu/x64/jit_brgemm_conv.cpp; 45s local ... (12 actions, 11 running)
[3,642 / 6,908] Compiling llvm/lib/Passes/PassBuilder.cpp; 45s local ... (12 actions, 11 running)
[3,929 / 6,908] Compiling mlir/lib/Dialect/LLVMIR/IR/LLVMDialect.cpp; 56s local ... (12 actions, 11 running)
[4,340 / 6,908] Compiling stablehlo/dialect/StablehloOps.cpp; 41s local ... (12 actions, 11 running)
[4,628 / 6,908] Compiling mlir/lib/Dialect/SPIRV/IR/SPIRVOpDefinition.cpp; 33s local ... (12 actions, 11 running)
[4,938 / 6,908] Compiling llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp; 13s local ... (12 actions, 11 running)
[5,317 / 6,908] Compiling llvm/lib/Target/X86/X86ISelLowering.cpp; 22s local ... (12 actions, 11 running)
[5,761 / 6,908] Compiling mlir/lib/Dialect/Linalg/IR/LinalgDialect.cpp; 33s local ... (12 actions, 11 running)
[6,253 / 6,908] Compiling xla/mlir_hlo/mhlo/IR/hlo_ops.cc; 67s local ... (12 actions, 11 running)
[6,854 / 6,909] Compiling xla/service/gpu/runtime/fused_attention.cc; 64s local ... (12 actions, 11 running)
ERROR: /home/awlex/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/xla/service/gpu/BUILD:257:11: Compiling xla/service/gpu/ir_emitter_unnested.cc failed: (Exit 1): crosstool_wrapper_driver_is_not_gcc failed: error executing command external/local_config_rocm/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer ... (remaining 356 arguments skipped)
/home/awlex/.cache/bazel/_bazel_awlex/74b6e6c2abb213e1ba59aee5534c65a2/execroot/xla/external/local_config_rocm/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc:23: DeprecationWarning: 'pipes' is deprecated and slated for removal in Python 3.13
import pipes
In file included from ./xla/shape_util.h:36,
from ./xla/index_util.h:25,
from ./xla/literal.h:41,
from ./xla/hlo/ir/dfs_hlo_visitor.h:26,
from ./xla/hlo/ir/hlo_computation.h:32,
from ./xla/service/gpu/ir_emitter_unnested.h:31,
from xla/service/gpu/ir_emitter_unnested.cc:16:
external/com_google_absl/absl/log/check.h:57: warning: "CHECK" redefined
57 | #define CHECK(condition) ABSL_LOG_INTERNAL_CHECK_IMPL((condition), #condition)
|
In file included from external/tsl/tsl/platform/logging.h:26,
from external/tsl/tsl/platform/status.h:34,
from ./xla/status.h:19,
from ./xla/statusor.h:18,
from ./xla/hlo/ir/hlo_opcode.h:24,
from ./xla/hlo/ir/dfs_hlo_visitor.h:25:
external/tsl/tsl/platform/default/logging.h:308: note: this is the location of the previous definition
308 | #define CHECK(condition)              

|
external/com_google_absl/absl/log/check.h:65: warning: "QCHECK" redefined
65 | #define QCHECK(condition) ABSL_LOG_INTERNAL_QCHECK_IMPL((condition), #condition)
|
external/tsl/tsl/platform/default/logging.h:542: note: this is the location of the previous definition
542 | #define QCHECK(condition) CHECK(condition)
|
external/com_google_absl/absl/log/check.h:88: warning: "DCHECK" redefined
88 | #define DCHECK(condition) ABSL_LOG_INTERNAL_DCHECK_IMPL((condition), #condition)
|
external/tsl/tsl/platform/default/logging.h:521: note: this is the location of the previous definition
521 | #define DCHECK(condition) 

|
external/com_google_absl/absl/log/check.h:116: warning: "CHECK_EQ" redefined
116 | #define CHECK_EQ(val1, val2) 

|
external/tsl/tsl/platform/default/logging.h:499: note: this is the location of the previous definition
499 | #define CHECK_EQ(val1, val2) CHECK_OP(Check_EQ, ==, val1, val2)
|
external/com_google_absl/absl/log/check.h:118: warning: "CHECK_NE" redefined
118 | #define CHECK_NE(val1, val2) 

|
external/tsl/tsl/platform/default/logging.h:500: note: this is the location of the previous definition
500 | #define CHECK_NE(val1, val2) CHECK_OP(Check_NE, !=, val1, val2)
|
external/com_google_absl/absl/log/check.h:120: warning: "CHECK_LE" redefined
120 | #define CHECK_LE(val1, val2) 

|
external/tsl/tsl/platform/default/logging.h:501: note: this is the location of the previous definition
501 | #define CHECK_LE(val1, val2) CHECK_OP(Check_LE, <=, val1, val2)
|
external/com_google_absl/absl/log/check.h:122: warning: "CHECK_LT" redefined
122 | #define CHECK_LT(val1, val2) 

|
external/tsl/tsl/platform/default/logging.h:502: note: this is the location of the previous definition
502 | #define CHECK_LT(val1, val2) CHECK_OP(Check_LT, <, val1, val2)
|
external/com_google_absl/absl/log/check.h:124: warning: "CHECK_GE" redefined
124 | #define CHECK_GE(val1, val2) 

|
external/tsl/tsl/platform/default/logging.h:503: note: this is the location of the previous definition
503 | #define CHECK_GE(val1, val2) CHECK_OP(Check_GE, >=, val1, val2)
|
external/com_google_absl/absl/log/check.h:126: warning: "CHECK_GT" redefined
126 | #define CHECK_GT(val1, val2) 

|
external/tsl/tsl/platform/default/logging.h:504: note: this is the location of the previous definition
504 | #define CHECK_GT(val1, val2) CHECK_OP(Check_GT, >, val1, val2)
|
external/com_google_absl/absl/log/check.h:128: warning: "QCHECK_EQ" redefined
128 | #define QCHECK_EQ(val1, val2) 

|
external/tsl/tsl/platform/default/logging.h:543: note: this is the location of the previous definition
543 | #define QCHECK_EQ(x, y) CHECK_EQ(x, y)
|
external/com_google_absl/absl/log/check.h:130: warning: "QCHECK_NE" redefined
130 | #define QCHECK_NE(val1, val2) 

|
external/tsl/tsl/platform/default/logging.h:544: note: this is the location of the previous definition
544 | #define QCHECK_NE(x, y) CHECK_NE(x, y)
|
external/com_google_absl/absl/log/check.h:132: warning: "QCHECK_LE" redefined
132 | #define QCHECK_LE(val1, val2) 

|
external/tsl/tsl/platform/default/logging.h:545: note: this is the location of the previous definition
545 | #define QCHECK_LE(x, y) CHECK_LE(x, y)
|
external/com_google_absl/absl/log/check.h:134: warning: "QCHECK_LT" redefined
134 | #define QCHECK_LT(val1, val2) 

|
external/tsl/tsl/platform/default/logging.h:546: note: this is the location of the previous definition
546 | #define QCHECK_LT(x, y) CHECK_LT(x, y)
|
external/com_google_absl/absl/log/check.h:136: warning: "QCHECK_GE" redefined
136 | #define QCHECK_GE(val1, val2) 

|
external/tsl/tsl/platform/default/logging.h:547: note: this is the location of the previous definition
547 | #define QCHECK_GE(x, y) CHECK_GE(x, y)
|
external/com_google_absl/absl/log/check.h:138: warning: "QCHECK_GT" redefined
138 | #define QCHECK_GT(val1, val2) 

|
external/tsl/tsl/platform/default/logging.h:548: note: this is the location of the previous definition
548 | #define QCHECK_GT(x, y) CHECK_GT(x, y)
|
external/com_google_absl/absl/log/check.h:140: warning: "DCHECK_EQ" redefined
140 | #define DCHECK_EQ(val1, val2) 

|
external/tsl/tsl/platform/default/logging.h:531: note: this is the location of the previous definition
531 | #define DCHECK_EQ(x, y) _TF_DCHECK_NOP(x, y)
|
external/com_google_absl/absl/log/check.h:142: warning: "DCHECK_NE" redefined
142 | #define DCHECK_NE(val1, val2) 

|
external/tsl/tsl/platform/default/logging.h:532: note: this is the location of the previous definition
532 | #define DCHECK_NE(x, y) _TF_DCHECK_NOP(x, y)
|
external/com_google_absl/absl/log/check.h:144: warning: "DCHECK_LE" redefined
144 | #define DCHECK_LE(val1, val2) 

|
external/tsl/tsl/platform/default/logging.h:533: note: this is the location of the previous definition
533 | #define DCHECK_LE(x, y) _TF_DCHECK_NOP(x, y)
|
external/com_google_absl/absl/log/check.h:146: warning: "DCHECK_LT" redefined
146 | #define DCHECK_LT(val1, val2) 

|
external/tsl/tsl/platform/default/logging.h:534: note: this is the location of the previous definition
534 | #define DCHECK_LT(x, y) _TF_DCHECK_NOP(x, y)
|
external/com_google_absl/absl/log/check.h:148: warning: "DCHECK_GE" redefined
148 | #define DCHECK_GE(val1, val2) 

|
external/tsl/tsl/platform/default/logging.h:535: note: this is the location of the previous definition
535 | #define DCHECK_GE(x, y) _TF_DCHECK_NOP(x, y)
|
external/com_google_absl/absl/log/check.h:150: warning: "DCHECK_GT" redefined
150 | #define DCHECK_GT(val1, val2) 

|
external/tsl/tsl/platform/default/logging.h:536: note: this is the location of the previous definition
536 | #define DCHECK_GT(x, y) _TF_DCHECK_NOP(x, y)
|
xla/service/gpu/ir_emitter_unnested.cc: In member function ‘tsl::Status xla::gpu::IrEmitterUnnested::EmitCubDeviceRadixSort(mlir::Operation*)’:
xla/service/gpu/ir_emitter_unnested.cc:1472:33: error: ‘CubSortThunk’ was not declared in this scope
1472 |   auto thunk = std::make_unique<CubSortThunk>(
|                                 ^~~~~~~~~~~~
xla/service/gpu/ir_emitter_unnested.cc:1472:46: error: no matching function for call to ‘make_unique< >(xla::gpu::Thunk::ThunkInfo, xla::PrimitiveType, std::optionalxla::PrimitiveType, std::vectorxla::BufferAllocation::Slice&, std::vectorxla::BufferAllocation::Slice&, xla::BufferAllocation::Slice&, bool)’
1472 |   auto thunk = std::make_unique<CubSortThunk>(
|                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
1473 |       Thunk::ThunkInfo::WithProfileAnnotation(op),
|       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1474 |       GetShape(op->getOperand(0)).element_type(),
|       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1475 |       radix_sort_op.getInputs().size() == 2
|       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1476 |           ? std::optional(GetShape(op->getOperand(1)).element_type())
|           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1477 |           : std::nullopt,
|           ~~~~~~~~~~~~~~~
1478 |       operands, results, scratch, radix_sort_op.getDescending());
|       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1/../../../../include/c++/13.2.1/memory:78,
from ./xla/service/gpu/ir_emitter_unnested.h:21:
/usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1/../../../../include/c++/13.2.1/bits/unique_ptr.h:1069:5: note: candidate: ‘template<class _Tp, class ... _Args> std::__detail::__unique_ptr_t<_Tp> std::make_unique(_Args&& ...)’
1069 |     make_unique(_Args&&... __args)
|     ^~~~~~~~~~~
/usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1/../../../../include/c++/13.2.1/bits/unique_ptr.h:1069:5: note:   template argument deduction/substitution failed:
xla/service/gpu/ir_emitter_unnested.cc:1472:46: error: template argument 1 is invalid
1472 |   auto thunk = std::make_unique<CubSortThunk>(
|                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
1473 |       Thunk::ThunkInfo::WithProfileAnnotation(op),
|       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1474 |       GetShape(op->getOperand(0)).element_type(),
|       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1475 |       radix_sort_op.getInputs().size() == 2
|       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1476 |           ? std::optional(GetShape(op->getOperand(1)).element_type())
|           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1477 |           : std::nullopt,
|           ~~~~~~~~~~~~~~~
1478 |       operands, results, scratch, radix_sort_op.getDescending());
|       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1/../../../../include/c++/13.2.1/bits/unique_ptr.h:1084:5: note: candidate: ‘template<class _Tp> std::__detail::__unique_ptr_array_t<_Tp> std::make_unique(size_t)’
1084 |     make_unique(size_t __num)
|     ^~~~~~~~~~~
/usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1/../../../../include/c++/13.2.1/bits/unique_ptr.h:1084:5: note:   candidate expects 1 argument, 7 provided
/usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1/../../../../include/c++/13.2.1/bits/unique_ptr.h:1094:5: note: candidate: ‘template<class _Tp, class ... _Args> std::__detail::_invalid_make_unique_t<Tp> std::make_unique(Args&& ...)’ (deleted)
1094 |     make_unique(Args&&...) = delete;
|     ^~~~~~~~~~~
/usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1/../../../../include/c++/13.2.1/bits/unique_ptr.h:1094:5: note:   template argument deduction/substitution failed:
xla/service/gpu/ir_emitter_unnested.cc:1472:46: error: template argument 1 is invalid
1472 |   auto thunk = std::make_unique<CubSortThunk>(
|                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
1473 |       Thunk::ThunkInfo::WithProfileAnnotation(op),
|       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1474 |       GetShape(op->getOperand(0)).element_type(),
|       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1475 |       radix_sort_op.getInputs().size() == 2
|       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1476 |           ? std::optional(GetShape(op->getOperand(1)).element_type())
|           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1477 |           : std::nullopt,
|           ~~~~~~~~~~~~~~~
1478 |       operands, results, scratch, radix_sort_op.getDescending());
|       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/include/unistd.h:226,
from external/com_google_absl/absl/base/internal/thread_identity.h:27,
from external/com_google_absl/absl/synchronization/mutex.h:70,
from external/com_google_absl/absl/strings/internal/cordz_info.h:31,
from external/com_google_absl/absl/strings/cord.h:91,
from external/com_google_absl/absl/container/internal/hash_function_defaults.h:56,
from external/com_google_absl/absl/container/flat_hash_map.h:41,
from ./xla/hlo/ir/hlo_computation.h:26:
external/tsl/tsl/concurrency/async_value.h: In instantiation of ‘static void tsl::internal::ConcreteAsyncValue<T>::VerifyOffsets() [with T = tsl::DummyValueForErrorAsyncValue]’:
external/tsl/tsl/concurrency/async_value.h:536:18:   required from ‘tsl::internal::ConcreteAsyncValue<T>::ConcreteAsyncValue(absl::lts_20230802::Status) [with T = tsl::DummyValueForErrorAsyncValue]’
external/tsl/tsl/concurrency/async_value.h:727:30:   required from here
external/tsl/tsl/concurrency/async_value.h:702:28: warning: ‘offsetof’ within non-standard-layout type ‘tsl::internal::ConcreteAsyncValuetsl::DummyValueForErrorAsyncValue’ is conditionally-supported [-Winvalid-offsetof]
702 |     static_assert(offsetof(ConcreteAsyncValue<T>, data_store.data) ==
|                            ^
external/tsl/tsl/concurrency/async_value.h:706:28: warning: ‘offsetof’ within non-standard-layout type ‘tsl::internal::ConcreteAsyncValuetsl::DummyValueForErrorAsyncValue’ is conditionally-supported [-Winvalid-offsetof]
706 |     static_assert(offsetof(ConcreteAsyncValue<T>, data_store.error) ==
|                            ^
Target //xla/extension:xla_extension failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 2466.368s, Critical Path: 158.87s
INFO: 6899 processes: 469 internal, 6430 local.
FAILED: Build did NOT complete successfully
FAILED: Build did NOT complete successfully
make: *** [Makefile:26: /home/awlex/.cache/xla/0.6.0/cache/build/xla_extension-x86_64-linux-gnu-rocm.tar.gz] Error 1
could not compile dependency :xla, "mix compile" failed. Errors may have been logged above. You can recompile this dependency with "mix deps.compile xla --force", update it with "mix deps.update xla" or clean it with "mix deps.clean xla"
==> fusemega
** (Mix) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. If you are using
Ubuntu or any other Debian-based system, install the packages
"build-essential". Also install "erlang-dev" package if not
included in your Erlang/OTP version. If you're on Fedora, run
"dnf group install 'Development Tools'".

Dec 27 '23 22:12 Awlexus

Hey @Awlexus, this could be an issue with the build environment. To be sure, you can alternatively use the Docker scripts (./build.sh rocm), then use XLA_ARCHIVE_URL=file:///path/to/build.tzr.gz accordingly.

In case your GPU uses gfx1100 (7900 XTX), you may need to use a more recent XLA revision as per https://github.com/elixir-nx/xla/issues/63#issuecomment-1844195261 (either by setting OPENXLA_GIT_REV with mix compile or changing the Makefile directly in case of the Docker build).

Dec 28 '23 06:12 jonatanklosko

Thanks @jonatanklosko, I was able to compile it by using a a more recently xla git ref, but I could not get it to start GPU. I tried again by using the docker script to build it (which took a long time) and experienced the same error. It was able to allocate the memory, but the program would soon after be stopped by the operating system. Not sure where exactly this error comes from.

Error log


2023-12-28 23:43:05.394087: E xla/stream_executor/plugin_registry.cc:90] Invalid plugin kind specified: DNN
[info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[info] XLA service 0x7fa4c018dc30 initialized for platform ROCM (this does not guarantee that XLA will be used). Devices:
[info]   StreamExecutor device (0): AMD Radeon RX 6900 XT, AMDGPU ISA version: gfx1030
[info] Using BFC allocator.
[info] XLA backend allocating 15446782771 bytes on device 0 for BFCAllocator.
fish: Job 1, 'iex -S mix phx.server $argv' terminated by signal SIGSEGV (Address boundary error)

Dec 28 '23 22:12 Awlexus

Hmm, do you do any Nx stuff on boot? Does the error happen every time? I assume it doesn't happen if you use CPU only? You can also try ELIXIR_ERL_OPTIONS="+sssdio 128 +sssdcpu 128", though it rather helps with segfaults.

Dec 29 '23 05:12 jonatanklosko

Sorry for the late reply, I was away for a bit.

I'm not sure what changed since then, but now I'm getting a different error message. I already tried to write out a reply, before I noticed the change, so I added it at the end in case it could be helpful.

I now ran into the error message (RuntimeError) bitcode module not found at ./opencl.bc, which I was able to resolve by setting ROCM_PATH=/opt/rocm (Mentioning this in case someone else runs into this)

Now I'm running into the following error that soon afterwards causes the OS to send a SIGABRT

2023-12-31 18:56:44.607676: E xla/stream_executor/plugin_registry.cc:90] Invalid plugin kind specified: DNN
[info] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[info] XLA service 0x7fe7ac1707a0 initialized for platform ROCM (this does not guarantee that XLA will be used). Devices:
[info]   StreamExecutor device (0): AMD Radeon RX 6900 XT, AMDGPU ISA version: gfx1030
[info] Using BFC allocator.
[info] XLA backend allocating 15446782771 bytes on device 0 for BFCAllocator.

...

beam.smp: /usr/src/debug/hip-runtime-amd/clr-rocm-5.7.1/hipamd/src/hip_code_object.cpp:762: hip::FatBinaryInfo** hip::StatCO::addFatBinary(const void*, bool): Assertion `err == hipSuccess' failed.

Old Reply

do you do any Nx stuff on boot?

I've added a serving of openai/whisper to my application's supervision tree, but that should be all

    {:ok, model_info} = Bumblebee.load_model({:hf, @whisper_model})
    {:ok, featurizer} = Bumblebee.load_featurizer({:hf, @whisper_model})
    {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, @whisper_model})
    {:ok, generation_config} = Bumblebee.load_generation_config({:hf, @whisper_model})
    generation_config = Bumblebee.configure(generation_config, max_new_tokens: 100)

    serving =
      Bumblebee.Audio.speech_to_text_whisper(
        model_info,
        featurizer,
        tokenizer,
        generation_config,
        compile: [batch_size: 4],
        chunk_num_seconds: 30,
        stream: true,
        defn_options: [compiler: EXLA]
      )

Does the error happen every time? I assume it doesn't happen if you use CPU only?

Yes, it happens every time, before the serving is able to complete a single run

Dec 31 '23 18:12 Awlexus

Hmm, this looks like /opt/rocm is likely a symlink to a more specific version like /opt/rocm-5.7.1, let's set ROCM_PATH to that just to be sure. Otherwise maybe there's a certain ROCM HIP package missing in the environment?

Jan 02 '24 09:01 jonatanklosko

I'm running Arch Linux and rely on the packages provided there, so I'm not sure what I could be missing. I have installed every package that pops up when I search for rocm, but just to be sure I've provided a list of the installed packages below.

Hmm, this looks like /opt/rocm is likely a symlink to a more specific version

/opt/rocm really just links to the packages installed on my system.

$ ls -lah /opt
drwxr-xr-x 34 root root 4.0K Dec 31 18:53  rocm/

Installed rocm packages

comgr 5.7.1-1 Compiler support library for ROCm LLVM
hip-runtime-amd 5.7.1-1 Heterogeneous Interface for Portability ROCm
hipblas 5.7.1-1 ROCm BLAS marshalling library
hsa-rocr 5.7.1-1 HSA Runtime API and runtime for ROCm
magma-hip 2.7.2-2 Matrix Algebra on GPU and Multicore Architectures (with ROCm/HIP)
python-pytorch-rocm 2.1.2-1 Tensors and Dynamic neural networks in Python with strong GPU acceleration (with ROCm)
python-torchvision-rocm 0.16.2-1 Datasets, transforms, and models specific to computer vision (with ROCM support)
rccl 5.7.1-1 ROCm Communication Collectives Library
rocalution 5.7.1-1 Next generation library for iterative sparse solvers for ROCm platform
rocblas 5.7.1-1 Next generation BLAS implementation for ROCm platform
rocfft 5.7.1-1 Next generation FFT implementation for ROCm
rocm-clang-ocl 5.7.1-1 OpenCL compilation with clang compiler
rocm-cmake 5.7.1-1 CMake modules for common build tasks needed for the ROCm software stack
rocm-core 5.7.1-1 AMD ROCm core package (version files)
rocm-device-libs 5.7.1-1 ROCm Device Libraries
rocm-hip-libraries 5.7.1-2 Develop certain applications using HIP and libraries for AMD platforms
rocm-hip-runtime 5.7.1-2 Packages to run HIP applications on the AMD platform
rocm-hip-sdk 5.7.1-2 Develop applications using HIP and libraries for AMD platforms
rocm-language-runtime 5.7.1-2 ROCm runtime
rocm-llvm 5.7.1-1 Radeon Open Compute - LLVM toolchain (llvm, clang, lld)
rocm-ml-libraries 5.7.1-2 Packages for key Machine Learning libraries
rocm-ml-sdk 5.7.1-2 develop and run Machine Learning applications optimized for AMD platforms
rocm-opencl-runtime 5.7.1-1 OpenCL implementation for AMD
rocm-opencl-sdk 5.7.1-2 Develop OpenCL-based applications for AMD platforms
rocm-smi-lib 5.7.1-1 ROCm System Management Interface Library
rocminfo 5.7.1-1 ROCm Application for Reporting System Info
rocrand 5.7.1-1 Pseudo-random and quasi-random number generator on ROCm
rocsolver 5.7.1-1 Subset of LAPACK functionality on the ROCm platform
rocsparse 5.7.1-1 BLAS for sparse computation on top of ROCm
rocthrust 5.7.1-1 Port of the Thrust parallel algorithm library atop HIP/ROCm
roctracer 5.7.1-1 ROCm tracer library for performance tracing

Jan 04 '24 13:01 Awlexus

I see. It must be something environment related, given that others managed to run it with that revision, but I don't have any more guesses right now.

One alternative would be running stuff inside Docker, though that's not exactly convenient. Or you could try building with the latest openxla revision to see if it's something fixed upstream, but note that this usually requires some adjustments in the build file or/and in exla (depending on how much the xla APIs changed).

Jan 05 '24 04:01 jonatanklosko

We just had a new release, see https://github.com/elixir-nx/xla/issues/82#issuecomment-2124230058. You can try it with ROCm 6.0, and if there are issues, leave a comment on #82 :)

May 22 '24 08:05 jonatanklosko

xla
xla copied to clipboard

XLA_TARGET=rocm compilation failed with "crosstool_wrapper_driver_is_not_gcc failed"

xla xla copied to clipboard

XLA_TARGET=rocm compilation failed with "crosstool_wrapper_driver_is_not_gcc failed"

xla
xla copied to clipboard