xla
xla copied to clipboard
[Tracking] @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so failed to build after updating XLA pin
🐛 Bug
After updating XLA pin from 32ebd694c4d0442e241d76324ff1a721831366b4 to 590cd6fcd1ed24ab9cf494789a0fc524b94a4a6a in PR https://github.com/pytorch/xla/pull/8079/files
Our CI has the following failure: https://github.com/pytorch/xla/actions/runs/11060810258/job/30732124138?pr=8079 ? the object that is failed to build is bazel build @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so which is not our target.
The exact error is
ERROR: /github/home/.cache/bazel/_bazel_root/197a057057a49e5811107144e2d78508/external/xla/xla/stream_executor/cuda/BUILD:450:19: no such target '@local_config_cuda//cuda:implicit_cuda_headers_dependency': target 'implicit_cuda_headers_dependency' not declared in package 'cuda' defined by /github/home/.cache/bazel/_bazel_root/197a057057a49e5811107144e2d78508/external/local_config_cuda/cuda/BUILD (Tip: use query "@local_config_cuda//cuda:*" to see all the targets in that package) and referenced by '@xla//xla/stream_executor/cuda:delay_kernel_cuda_cuda'
this @local_config_cuda
is defined by using upstream's (https://github.com/google/tsl) cuda_configure
starlack function:
like this:
load(
"@tsl//third_party/gpus/cuda/hermetic:cuda_configure.bzl",
"cuda_configure",
)
cuda_configure(name = "local_config_cuda")
this bit of code is copied by following this deprecated section of this doc: https://github.com/openxla/xla/blob/main/docs/hermetic_cuda.md#deprecated-non-hermetic-cudacudnn-usage
Current theory:
cuda_configure function is supposed to setup the local_config_cuda
to have the build target that tsl needs. But this deprecated non-hermetic version did not do that.
Current tried actions:
We tried to follow the hermetic cuda setup described in this doc: https://github.com/openxla/xla/blob/main/docs/hermetic_cuda.md#deprecated-non-hermetic-cudacudnn-usage
However, it requires the use of clang compiler instead of gcc.
I am attempting to use clang, but this line that forces gcc claims that clang has issues: https://github.com/pytorch/xla/blob/940bee453fb27a023b360469487af2a8831966d6/.bazelrc#L27
With clang it produces this error:
ERROR: /github/home/.cache/bazel/_bazel_root/197a057057a49e5811107144e2d78508/external/llvm-project/llvm/BUILD.bazel:251:11: Compiling llvm/lib/Support/Valgrind.cpp [for tool] failed: undeclared inclusion(s) in rule '@llvm-project//llvm:Support':
this rule is missing dependency declarations for the following files included by 'llvm/lib/Support/Valgrind.cpp':
'/usr/lib/clang/11.0.1/include/stddef.h'
'/usr/lib/clang/11.0.1/include/__stddef_max_align_t.h'
Which is weird because stddef.h
is a system header and bazel should not ask for extra BUILD dependency declared for this.
This post in stackoverflow
says that we should clean bazel cache. Which we did by adding bazel clean --expunge
right before the build, and it still doesnt work.
The latest CI with the above change is: https://github.com/pytorch/xla/actions/runs/11115985671/job/30885415097?pr=8079