nx icon indicating copy to clipboard operation
nx copied to clipboard

Issue building exla from source

Open liamkillingback opened this issue 7 months ago • 5 comments

Hi, just playing around in livebook with Axon, Nx, EXLA and the like.

I've got a 5060ti so I need to build EXLA from source with a few environment variables to get it to work, but I'm running into an issue:

Running the livebook:nightly-cuda12 container.

with mix install:

Mix.install([
  {:exla, "~> 0.9.2"},
  {:axon, "~> 0.7.0"},
  {:explorer, "~> 0.10.1"},
  {:kino, "~> 0.16.0"},
  {:kino_vega_lite, "~> 0.1.13"},
  {:nx, "~> 0.9.2"},
  {:polaris, "~> 0.1.0"},
  {:scholar, "~> 0.4.0"},
  {:req, "~> 0.5.10"}
],
system_env: [
  {"XLA_TARGET", "cuda12"},
  {"XLA_BUILD", "true"},
  {"TF_CUDA_COMPUTE_CAPABILITIES", "sm_50,sm_60,sm_70,sm_80,compute_90,sm_120"}
],
config: [
  nx: [
    default_backend: {EXLA.Backend, client: :cuda},
  ],
  exla: [
    clients: [
      cuda: [
        memory_fraction: 0.6,
        platform: :cuda
      ]
    ]
  ]
])

I'm getting the error on build:

++ pwd
+ dir=/home/livebook/.cache/xla_extension/xla-fd58925adee147d38c25a085354e15427a12d00a/patches
++ uname -m
+ arch=x86_64
+ [[ x86_64 == \a\a\r\c\h\6\4 ]]
rm -f /home/livebook/.cache/xla_extension/xla-fd58925adee147d38c25a085354e15427a12d00a/xla/extension && \
	ln -s "/home/livebook/.cache/mix/installs/elixir-1.18.3-erts-15.2.6/81eb9e35dd07f97207e14a0548a47e2a/deps/xla/extension" /home/livebook/.cache/xla_extension/xla-fd58925adee147d38c25a085354e15427a12d00a/xla/extension && \
	cd /home/livebook/.cache/xla_extension/xla-fd58925adee147d38c25a085354e15427a12d00a && \
	bazel build --define "framework_shared_object=false" -c opt   --config=cuda --action_env=TF_CUDA_COMPUTE_CAPABILITIES="sm_50,sm_60,sm_70,sm_80,compute_90" //xla/extension:xla_extension && \
	mkdir -p /home/livebook/.cache/xla/0.8.0/build/ && \
	cp -f /home/livebook/.cache/xla_extension/xla-fd58925adee147d38c25a085354e15427a12d00a/bazel-bin/xla/extension/xla_extension.tar.gz /home/livebook/.cache/xla/0.8.0/build/xla_extension-0.8.0-x86_64-linux-gnu-cuda12.tar.gz
/bin/sh: 4: bazel: not found
make: *** [Makefile:26: /home/livebook/.cache/xla/0.8.0/build/xla_extension-0.8.0-x86_64-linux-gnu-cuda12.tar.gz] Error 127
could not compile dependency :xla, "mix compile" failed. Errors may have been logged above. You can recompile this dependency with "mix deps.compile xla --force", update it with "mix deps.update xla" or clean it with "mix deps.clean xla"


** (Mix.Error) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. If you are using
Ubuntu or any other Debian-based system, install the packages
"build-essential". Also install "erlang-dev" package if not
included in your Erlang/OTP version. If you're on Fedora, run
"dnf group install 'Development Tools'".

    (mix 1.18.3) lib/mix.ex:618: Mix.raise/2
    (elixir_make 0.9.0) lib/elixir_make/compiler.ex:53: ElixirMake.Compiler.compile/1
    (mix 1.18.3) lib/mix/task.ex:495: anonymous fn/3 in Mix.Task.run_task/5
    (mix 1.18.3) lib/mix/tasks/compile.all.ex:117: Mix.Tasks.Compile.All.run_compiler/2
    (mix 1.18.3) lib/mix/tasks/compile.all.ex:97: Mix.Tasks.Compile.All.compile/4
    (mix 1.18.3) lib/mix/tasks/compile.all.ex:71: Mix.Tasks.Compile.All.do_run/2
    (mix 1.18.3) lib/mix/task.ex:495: anonymous fn/3 in Mix.Task.run_task/5

I'm not sure where to go from here, I've checked inside the container terminal that gcc and make are available but not inside the livebook instance it seems..

I've also tried different containers and on my host machine with the same error.

Any help would be epic! :)

Thanks a bunch

liamkillingback avatar May 16 '25 11:05 liamkillingback

You are missing bazel

polvalente avatar May 16 '25 12:05 polvalente

I suggest going to the elixir-nx/xla repository and check out the dependencies it installs

polvalente avatar May 16 '25 12:05 polvalente

I've installed bazel and still fails with the makefile. I'll have a look :)

liamkillingback avatar May 17 '25 00:05 liamkillingback

If running on livebook, ensure that the PATH is properly set. Livebook desktop will run with a different PATH configuration

polvalente avatar May 17 '25 02:05 polvalente

I've got further and it's compiling but still crashes after - Bazel 6.5.0 gcc-9 apt install nvidia-cudnn (cudnn.h file not found so fresh install fixes that)

with:

./xla/service/gpu/gpu_prim.h(63): error: no instance of overloaded function "cub::ThreadLoadVolatilePointer" matches the specified type
  ThreadLoadVolatilePointer<tsl::bfloat16>(tsl::bfloat16 *ptr,
  ^

4 errors detected in the compilation of "xla/service/gpu/cub_sort_kernel.cu.cc".
Target //xla/extension:xla_extension failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 2827.507s, Critical Path: 525.19s
INFO: 9970 processes: 4443 internal, 5527 local.
FAILED: Build did NOT complete successfully

Getting closer! haha

liamkillingback avatar May 17 '25 04:05 liamkillingback

@liamkillingback any updates here?

polvalente avatar Oct 01 '25 17:10 polvalente

I no longer need to build from source as I believe my card is now properly supported :) happy to close this issue

liamkillingback avatar Oct 01 '25 21:10 liamkillingback