xla_extension failed encountered when trying to use exla in a Docker container
I encounter xla_extension failed when I try to run exla while building a docker container. Here are some of the snippets from my Dockerfile:
ARG BUILDER_IMAGE="hexpm/elixir:1.14.0-erlang-24.0.1-debian-bullseye-20210902-slim"
ARG RUNNER_IMAGE="debian:bullseye-20210902-slim"
FROM ${BUILDER_IMAGE}
...
# install build dependencies
# https://github.com/elixir-nx/xla?tab=readme-ov-file#building-from-source
RUN apt-get update -y && apt-get install -y build-essential git apt-transport-https curl gnupg python3-pip gcc-9 g++-9 \
&& apt-get clean && rm -f /var/lib/apt/lists/*_*
RUN export CC=/usr/bin/gcc-9
# https://bazel.build/install/ubuntu#install-on-ubuntu
RUN curl -fsSL https://bazel.build/bazel-release.pub.gpg | gpg --dearmor >bazel-archive-keyring.gpg
RUN mv bazel-archive-keyring.gpg /usr/share/keyrings
RUN echo "deb [arch=amd64 signed-by=/usr/share/keyrings/bazel-archive-keyring.gpg] https://storage.googleapis.com/bazel-apt stable jdk1.8" | tee /etc/apt/sources.list.d/bazel.list
RUN apt-get update -y && apt-get install -y bazel-6.5.0
RUN ln -s /usr/bin/bazel-6.5.0 /usr/bin/bazel
RUN pip install numpy
...
I get this error after I run the Dockerfile
[4,467 / 5,843] Compiling mlir/lib/Dialect/LLVMIR/IR/LLVMDialect.cpp; 134s local ... (16 actions, 15 running)
[4,468 / 5,843] Compiling mlir/lib/Dialect/LLVMIR/IR/LLVMDialect.cpp; 136s local ... (16 actions running)
[4,469 / 5,843] Compiling mlir/lib/Dialect/LLVMIR/IR/LLVMDialect.cpp; 137s local ... (16 actions running)
[4,470 / 5,843] Compiling mlir/lib/Dialect/LLVMIR/IR/LLVMDialect.cpp; 139s local ... (16 actions running)
[4,470 / 5,843] Compiling mlir/lib/Dialect/LLVMIR/IR/LLVMDialect.cpp; 210s local ... (16 actions running)
ERROR: /home/user/.cache/bazel/_bazel_user/ee4c0f1833dfaa435cb867c88f5a190e/external/llvm-project/mlir/BUILD.bazel:4925:11: Compiling mlir/lib/Dialect/LLVMIR/IR/LLVMDialect.cpp failed: (Exit 1): gcc failed: error executing command (from target @llvm-project//mlir:LLVMDialect) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 85 arguments skipped)
gcc: fatal error: Killed signal terminated program cc1plus
compilation terminated.
Target //xla/extension:xla_extension failed to build
Use --verbose_failures to see the command lines of failed build steps.
[4,487 / 5,843] checking cached actions
INFO: Elapsed time: 1131.980s, Critical Path: 278.37s
INFO: 4487 processes: 343 internal, 4144 local.
FAILED: Build did NOT complete successfully
make: *** [Makefile:26: /home/user/.cache/xla/0.6.0/cache/build/xla_extension-x86_64-linux-gnu-cpu.tar.gz] Error 1
could not compile dependency :xla, "mix compile" failed. Errors may have been logged above. You can recompile this dependency with "mix deps.compile xla", update it with "mix deps.update xla" or clean it with "mix deps.clean xla"
==> lai
** (Mix) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. If you are using
Ubuntu or any other Debian-based system, install the packages
"build-essential". Also install "erlang-dev" package if not
included in your Erlang/OTP version. If you're on Fedora, run
"dnf group install 'Development Tools'".
I only encounter this issue when trying to build a docker container. I do not encounter any issues when I run mix phx.server. Do we have an official Dockerfile sample for cases where docker container setup is required?
Is there a reason you are trying to build XLA from source, rather than use the the precompiled binaries?
We use these dockerfiles for precompilation, so those instructions should work.
Ideally, we would prefer not to build the extension from source. I noticed that the xla gets built from source when we add exla in our dependencies. Here are the dependencies we've added along with exla:
{:bumblebee, "~> 0.5.3"},
{:nx, "~> 0.7.3"},
{:exla, "~> 0.7.3"},
{:explorer, "~> 0.9.0"}
We did not add the xla dependency in our list of dependencies, but somehow, it gets added (maybe because it's part of Nx). Do you have a sample Dockerfile which we could use as basis when using Bumblebee, Nx, and Exla, without the triggering the building of XLA from source? Our main goal for now is to be able to run Nx and Exla in a docker container. 👍
By default it will download a precompiled version. Does it print anything saying it can't use a precompiled and therefore it must compile from source?
I think it did. Here are some screenshots from today after removing the precompile steps in my Dockerfile
So you have XLA_BUILD set by any chance?
I did not set it anywhere (.bashprofile, Dockerfile etc). Based on the README.md it is set to false by default.
The build should trigger only when XLA_BUILD is set, otherwise it either downloads a precompiled binary or, if not available, raises an error.
One way to check would be to add RUN [ -z "$XLA_BUILD" ] || exit 1 before the compilation step and see if it goes on.
I did notice the image uses a rather outdated combo of Elixir and OTP, as well as an older Debian. If possible, I'd update to eliminate any possibility of the compilation being triggered by not finding the proper version/platform precompiled archive
It still went through 🥲
[+] Building 2.7s (14/14) FINISHED docker:default
=> [api internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 4.75kB 0.0s
=> [api internal] load metadata for docker.io/hexpm/elixir:1.14.0-erlang-24.0.1-debian-bullseye-20210902-slim 2.0s
=> [api auth] hexpm/elixir:pull token for registry-1.docker.io 0.0s
=> [api internal] load .dockerignore 0.0s
=> => transferring context: 1.31kB 0.0s
=> [api 1/8] FROM docker.io/hexpm/elixir:1.14.0-erlang-24.0.1-debian-bullseye-20210902-slim@sha256:02ed2d3f2e0360821017751464a6 0.0s
=> CACHED [api 2/8] RUN addgroup --gid 1000 user && adduser --disabled-password --ingroup user --uid 1000 user 0.0s
=> CACHED [api 3/8] RUN apt-get update -y && apt-get install -y build-essential git curl && apt-get clean && rm -f /var/lib 0.0s
=> CACHED [api 4/8] RUN mkdir -p /home/user/app && sh -c "git config --global url."https://${GITHUB_API_TOKEN}@github.com/" 0.0s
=> CACHED [api 5/8] WORKDIR /home/user/app 0.0s
=> CACHED [api 6/8] RUN mix local.hex --force && mix local.rebar --force 0.0s
=> CACHED [api 7/8] RUN mix do local.hex --force, local.rebar --force 0.0s
=> [api 8/8] RUN [ -z "$XLA_BUILD" ] || exit 1 0.4s
=> [api] exporting to image 0.1s
=> => exporting layers 0.1s
=> => writing image sha256:4af8189528e21cd493cfe8a2b41e0303905e614e6fe1526f3ceab03627094dab 0.0s
=> => naming to docker.io/library/lai-service-api 0.0s
=> [api] resolving provenance for metadata file 0.0s
WARN[0000] /home/jde/code/la/lai-service/docker-compose.yaml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion
[+] Creating 2/2
✔ Network lai-service_default Created 0.1s
✔ Container lai-service-db-1 Created 0.2s
[+] Running 1/1
✔ Container lai-service-db-1 Started 0.5s
Resolving Hex dependencies...
Resolution completed in 0.753s
Unchanged:
aws_rds_castore 1.2.0
aws_signature 0.3.2
axon 0.6.1
bumblebee 0.5.3
.....
===> Analyzing applications...
===> Compiling telemetry
===> Analyzing applications...
===> Compiling telemetry_poller
===> Analyzing applications...
===> Compiling certifi
===> Analyzing applications...
===> Compiling hackney
==> xla
Compiling 2 files (.ex)
Generated xla app
mkdir -p /home/user/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb && \
cd /home/user/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb && \
git init && \
git remote add origin https://github.com/openxla/xla.git && \
git fetch --depth 1 origin 771e38178340cbaaef8ff20f44da5407c15092cb && \
git checkout FETCH_HEAD && \
rm /home/user/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.bazelversion
hint: Using 'master' as the name for the initial branch. This default branch name
hint: is subject to change. To configure the initial branch name to use in all
hint: of your new repositories, which will suppress this warning, call:
hint:
hint: git config --global init.defaultBranch <name>
hint:
hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
hint: 'development'. The just-created branch can be renamed via this command:
hint:
hint: git branch -m <name>
Initialized empty Git repository in /home/user/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/.git/
warning: redirecting to https://github.com/openxla/xla.git/
From https://github.com/openxla/xla
* branch 771e38178340cbaaef8ff20f44da5407c15092cb -> FETCH_HEAD
Note: switching to 'FETCH_HEAD'.
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:
git switch -c <new-branch-name>
Or undo this operation with:
git switch -
Turn off this advice by setting config variable advice.detachedHead to false
HEAD is now at 771e381 [XLA:GPU] Check tensor_float_32_execution_enabled() in Triton codegen too
rm -f /home/user/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/xla/extension && \
ln -s "/home/user/app/deps/xla/extension" /home/user/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/xla/extension && \
cd /home/user/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb && \
bazel build --define "framework_shared_object=false" -c opt //xla/extension:xla_extension && \
mkdir -p /home/user/.cache/xla/0.6.0/cache/build/ && \
cp -f /home/user/.cache/xla_extension/xla-771e38178340cbaaef8ff20f44da5407c15092cb/bazel-bin/xla/extension/xla_extension.tar.gz /home/user/.cache/xla/0.6.0/cache/build/xla_extension-x86_64-linux-gnu-cpu.tar.gz
/bin/sh: 4: bazel: not found
make: *** [Makefile:26: /home/user/.cache/xla/0.6.0/cache/build/xla_extension-x86_64-linux-gnu-cpu.tar.gz] Error 127
could not compile dependency :xla, "mix compile" failed. Errors may have been logged above. You can recompile this dependency with "mix deps.compile xla", update it with "mix deps.update xla" or clean it with "mix deps.clean xla"
==> lai
** (Mix) Could not compile with "make" (exit status: 2).
You need to have gcc and make installed. If you are using
Ubuntu or any other Debian-based system, install the packages
"build-essential". Also install "erlang-dev" package if not
included in your Erlang/OTP version. If you're on Fedora, run
"dnf group install 'Development Tools'".
Interesting, I don't have any idea at the moment. It would be helpful if you could minimize it into a reproducible repo, like an empty mix project with the deps and the Dockerfile :)
Got something similar:
5.162 ==> xla
5.162 Compiling 5 files (.ex)
5.267 Generated xla app
5.315
5.315 17:30:36.318 [info] Downloading a precompiled XLA archive for target aarch64-linux-gnu-cpu
9.752
9.752 17:30:40.757 [info] Successfully downloaded the XLA archive
10.47 ==> exla
10.47 Unpacking /root/.cache/xla/0.8.0/download/xla_extension-0.8.0-aarch64-linux-gnu-cpu.tar.gz into /app/deps/exla/cache
15.12 g++ cache/0.9.2/objs/exla.o cache/0.9.2/objs/exla_client.o cache/0.9.2/objs/exla_mlir.o cache/0.9.2/objs/custom_calls.o cache/0.9.2/objs/exla_nif_util.o cache/0.9.2/objs/ipc.o cache/0.9.2/objs/custom_calls/eigh_f32.o cache/0.9.2/objs/custom_calls/eigh_f64.o cache/0.9.2/objs/custom_calls/lu_bf16.o cache/0.9.2/objs/custom_calls/lu_f16.o cache/0.9.2/objs/custom_calls/lu_f32.o cache/0.9.2/objs/custom_calls/lu_f64.o cache/0.9.2/objs/custom_calls/qr_bf16.o cache/0.9.2/objs/custom_calls/qr_f16.o cache/0.9.2/objs/custom_calls/qr_f32.o cache/0.9.2/objs/custom_calls/qr_f64.o cache/0.9.2/objs/exla_cuda.o -o cache/libexla.so -Lcache/xla_extension/lib -lxla_extension -shared -Wl,-rpath,'$ORIGIN/xla_extension/lib'
15.14 cache/0.9.2/objs/exla.o: file not recognized: file format not recognized
15.14 collect2: error: ld returned 1 exit status
15.14 make: *** [Makefile:101: cache/libexla.so] Error 1
15.14 could not compile dependency :exla, "mix compile" failed. Errors may have been logged above. You can recompile this dependency with "mix deps.compile exla --force", update it with "mix deps.update exla" or clean it with "mix deps.clean exla"
15.14 ==> relax
15.14 ** (Mix) Could not compile with "make" (exit status: 2).
15.14 You need to have gcc and make installed. If you are using
15.14 Ubuntu or any other Debian-based system, install the packages
15.14 "build-essential". Also install "erlang-dev" package if not
15.14 included in your Erlang/OTP version. If you're on Fedora, run
15.14 "dnf group install 'Development Tools'".
[+] Running 0/1
⠹ Service api Building 94.2s
failed to solve: process "/bin/sh -c mix compile" did not complete successfully: exit code: 1
I'm running this in a Macbook M1 Pro. This is a bare minimal elixir repo available at https://github.com/georgeguimaraes/relax (using {:exla, "~> 0.9.2"})
All I'm running to trigger this is docker compose up --build in the repo.
Changing the dependency to {:exla, "~> 0.8.0"}
makes it work:
api-1 | ==> exla
api-1 | Using libexla.so from /root/.cache/xla/exla/elixir-1.17.3-erts-15.2-xla-0.8.0-exla-0.8.0-ioo6ddg2zbm7ovoei2oc4ucrjy/libexla.so
api-1 | Compiling 23 files (.ex)
api-1 | Generated exla app
api-1 | ==> relax
api-1 | Compiling 1 file (.ex)
api-1 | Generated relax app
api-1 | Running ExUnit with seed: 697364, max_cases: 8
api-1 |
api-1 | ..
api-1 | Finished in 0.01 seconds (0.00s async, 0.01s sync)
Using {:exla, "0.9.1"} makes the docker image recompile xla but it finishes and the test ran:
❯ docker compose up --build
[+] Running 0/0
[+] Running 0/1 Building 0.1s
[+] Building 57.3s (13/13) FINISHED docker:default
=> [api internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 577B 0.0s
=> [api internal] load metadata for mirror.gcr.io/hexpm/elixir:1.17.3-erlang-27.2-ubuntu-noble-20241015 1.3s
=> [api internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [api 1/7] FROM mirror.gcr.io/hexpm/elixir:1.17.3-erlang-27.2-ubuntu-noble-20241015@sha256:f3a173c0d868e720c77a63c83de10c4b169f939 0.0s
=> [api internal] load build context 0.5s
=> => transferring context: 3.14MB 0.5s
=> CACHED [api 2/7] RUN apt-get update -y && apt-get install -y inotify-tools build-essential erlang-dev git curl && apt-get clean 0.0s
=> CACHED [api 3/7] WORKDIR /app 0.0s
=> CACHED [api 4/7] RUN mix local.hex --force && mix local.rebar --force 0.0s
=> [api 5/7] COPY . . 0.9s
=> [api 6/7] RUN mix deps.get 2.8s
=> [api 7/7] RUN mix compile 49.7s
=> [api] exporting to image 2.1s
=> => exporting layers 2.1s
=> => writing image sha256:629ef48806cb54cd54e5c420d3761de5693c4b24cc56e60c70dada4c38250f04 0.0s
[+] Running 2/1o docker.io/library/relax-api 0.0s
✔ Service api Built 57.4s
✔ Container relax-api-1 Recreated 0.1s
Attaching to api-1
api-1 | ==> complex
api-1 | Compiling 2 files (.ex)
api-1 | Generated complex app
api-1 | ==> nx
api-1 | Compiling 36 files (.ex)
api-1 | Generated nx app
api-1 | ==> nimble_pool
api-1 | Compiling 2 files (.ex)
api-1 | Generated nimble_pool app
api-1 | ==> elixir_make
api-1 | Compiling 8 files (.ex)
api-1 | Generated elixir_make app
api-1 | ==> xla
api-1 | Compiling 5 files (.ex)
api-1 | Generated xla app
api-1 | ==> exla
api-1 | Using libexla.so from /root/.cache/xla/exla/elixir-1.17.3-erts-15.2-xla-0.8.0-exla-0.9.1-t34ppw6zq2bvv4txq247gllfci/libexla.so
api-1 | Compiling 24 files (.ex)
api-1 | warning: Nx.Defn.stream/3 is deprecated. Move the streaming loop to Elixir instead
api-1 | │
api-1 | 356 │ Nx.Defn.stream(function, args, Keyword.put(options, :compiler, EXLA))
api-1 | │ ~
api-1 | │
api-1 | └─ lib/exla.ex:356:13: EXLA.stream/3
api-1 |
api-1 | Generated exla app
api-1 | ==> relax
api-1 | Compiling 1 file (.ex)
api-1 | Generated relax app
api-1 | Running ExUnit with seed: 309870, max_cases: 8
api-1 |
api-1 | ..
api-1 | Finished in 0.01 seconds (0.00s async, 0.01s sync)
api-1 | 1 doctest, 1 test, 0 failures
api-1 exited with code 0
btw you'll see in my repo that I'm using the latest Elixir, OTP, and Ubuntu available
@georgeguimaraes in your case, the issue is that you do COPY . . in the Dockerfile, which also copies deps/ and _build/ directories into the Docker build (which I expect you have). In deps/ there are EXLA platform-specific .o compilation artifacts and reusing them in the Dockerfile fails.
I was able to reproduce the error by running mix deps.get, mix compile in the repo and then docker build .. Removing deps/ and _build/ makes Docker build successfully :) The actual solution is to make COPY more specific and not include these directories.
Tks @jonatanklosko! TIL :)