xla icon indicating copy to clipboard operation
xla copied to clipboard

[Tracking] @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so failed to build after updating XLA pin

Open qihqi opened this issue 5 months ago • 3 comments

🐛 Bug

After updating XLA pin from 32ebd694c4d0442e241d76324ff1a721831366b4 to 590cd6fcd1ed24ab9cf494789a0fc524b94a4a6a in PR https://github.com/pytorch/xla/pull/8079/files

Our CI has the following failure: https://github.com/pytorch/xla/actions/runs/11060810258/job/30732124138?pr=8079 ? the object that is failed to build is bazel build @xla//xla/pjrt/c:pjrt_c_api_gpu_plugin.so which is not our target.

The exact error is

ERROR: /github/home/.cache/bazel/_bazel_root/197a057057a49e5811107144e2d78508/external/xla/xla/stream_executor/cuda/BUILD:450:19: no such target '@local_config_cuda//cuda:implicit_cuda_headers_dependency': target 'implicit_cuda_headers_dependency' not declared in package 'cuda' defined by /github/home/.cache/bazel/_bazel_root/197a057057a49e5811107144e2d78508/external/local_config_cuda/cuda/BUILD (Tip: use query "@local_config_cuda//cuda:*" to see all the targets in that package) and referenced by '@xla//xla/stream_executor/cuda:delay_kernel_cuda_cuda'

this @local_config_cuda is defined by using upstream's (https://github.com/google/tsl) cuda_configure starlack function: like this:

load(
   "@tsl//third_party/gpus/cuda/hermetic:cuda_configure.bzl",
   "cuda_configure",
)

cuda_configure(name = "local_config_cuda")

this bit of code is copied by following this deprecated section of this doc: https://github.com/openxla/xla/blob/main/docs/hermetic_cuda.md#deprecated-non-hermetic-cudacudnn-usage

Current theory:

cuda_configure function is supposed to setup the local_config_cuda to have the build target that tsl needs. But this deprecated non-hermetic version did not do that.

Current tried actions:

We tried to follow the hermetic cuda setup described in this doc: https://github.com/openxla/xla/blob/main/docs/hermetic_cuda.md#deprecated-non-hermetic-cudacudnn-usage

However, it requires the use of clang compiler instead of gcc.

I am attempting to use clang, but this line that forces gcc claims that clang has issues: https://github.com/pytorch/xla/blob/940bee453fb27a023b360469487af2a8831966d6/.bazelrc#L27

With clang it produces this error:

      ERROR: /github/home/.cache/bazel/_bazel_root/197a057057a49e5811107144e2d78508/external/llvm-project/llvm/BUILD.bazel:251:11: Compiling llvm/lib/Support/Valgrind.cpp [for tool] failed: undeclared inclusion(s) in rule '@llvm-project//llvm:Support':
      this rule is missing dependency declarations for the following files included by 'llvm/lib/Support/Valgrind.cpp':
        '/usr/lib/clang/11.0.1/include/stddef.h'
        '/usr/lib/clang/11.0.1/include/__stddef_max_align_t.h'

Which is weird because stddef.h is a system header and bazel should not ask for extra BUILD dependency declared for this.

This post in stackoverflow says that we should clean bazel cache. Which we did by adding bazel clean --expunge right before the build, and it still doesnt work.

The latest CI with the above change is: https://github.com/pytorch/xla/actions/runs/11115985671/job/30885415097?pr=8079

qihqi avatar Oct 01 '24 00:10 qihqi