GCC ICE: Segmentation Fault when building PyTorch/XLA
🐛 Bug
GCC fails to compile PyTorch/XLA (163193ebe9715d353c10b9fb6cb629ec88e2520e -- master branch), ending with an internal compiler error (ICE), apparently caused by a segmentation fault.
ERROR: external/xla/xla/service/spmd/shardy/stablehlo_round_trip/BUILD:44:11: Compiling xla/service/spmd/shardy/stablehlo_round_trip/export_ops.cc failed: (Exit 1): gcc failed: error executing CppCompile command (from target @@xla//xla/service/spmd/shardy/stablehlo_round_trip:export_ops) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 128 arguments skipped)
In file included from <command-line>:
/usr/include/stdc-predef.h: In substitution of 'template<class _Functor, class, class> std::function<std::unique_ptr<mlir::Pass>()>::function(_Functor) [with _Functor = <missing>; <template-parameter-1-2> = <missing>; <template-parameter-1-3> = <missing>]':
external/xla/xla/service/spmd/shardy/stablehlo_round_trip/export_ops.cc:249:53: required from here
/usr/include/stdc-predef.h:32:70: internal compiler error: Segmentation fault
32 | whether the overall intent is to support these features; otherwise,
| ^
0x7e2852d42ddf ???
./signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
0x7e2852d2dd79 __libc_start_main
../csu/libc-start.c:308
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <file:///usr/share/doc/gcc-10/README.Bugs> for instructions.
[17,006 / 19,133] Compiling xla/mlir_hlo/mhlo/IR/hlo_ops.cc; 99s local ... (24 actions, 23 running)
Target //:_XLAC.so failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 697.676s, Critical Path: 175.00s
INFO: 17030 processes: 10181 internal, 6849 local.
ERROR: Build did NOT complete successfully
error: command '/usr/local/bin/bazel' failed with exit code 1
error: subprocess-exited-with-error
While I'm not sure exactly what this is, it could be related to this gcc-10 bug.
Setup
Same image used in this CI run: https://github.com/pytorch/xla/actions/runs/17246611174/job/48937880278?pr=9588
Docker image: us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/development:3.12_tpuvm
Docker image sha256: c194788bb5ea6d76806371a80c59c08f33c5ed0186a88e4e65a54245cf0a9014
Additional Context
- Apparently, @qihqi also hit this error, which is why he updated
gcc-10togcc-11in #9565 (not on CI). - It's odd that CI doesn't end with this same error, even though the docker image used (thus, the compiler) is the same.
I tried again, compiling PyTorch without CUDA support. Still didn't work.
@ysiraichi is this still a problem? Is this why we don't have nightly torch-xla builds since 8/28?
I don't think that's the reason, since CI is still running fine. Maybe @bhavya01 has some information on that.
I am trying to debug with https://github.com/pytorch/xla/pull/9650
i.e. lets all use gcc-11