xla icon indicating copy to clipboard operation
xla copied to clipboard

GCC ICE: Segmentation Fault when building PyTorch/XLA

Open ysiraichi opened this issue 4 months ago • 4 comments

🐛 Bug

GCC fails to compile PyTorch/XLA (163193ebe9715d353c10b9fb6cb629ec88e2520e -- master branch), ending with an internal compiler error (ICE), apparently caused by a segmentation fault.

ERROR: external/xla/xla/service/spmd/shardy/stablehlo_round_trip/BUILD:44:11: Compiling xla/service/spmd/shardy/stablehlo_round_trip/export_ops.cc failed: (Exit 1): gcc failed: error executing CppCompile command (from target @@xla//xla/service/spmd/shardy/stablehlo_round_trip:export_ops) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 128 arguments skipped)
  In file included from <command-line>:
  /usr/include/stdc-predef.h: In substitution of 'template<class _Functor, class, class> std::function<std::unique_ptr<mlir::Pass>()>::function(_Functor) [with _Functor = <missing>; <template-parameter-1-2> = <missing>; <template-parameter-1-3> = <missing>]':
  external/xla/xla/service/spmd/shardy/stablehlo_round_trip/export_ops.cc:249:53:   required from here
  /usr/include/stdc-predef.h:32:70: internal compiler error: Segmentation fault
     32 |    whether the overall intent is to support these features; otherwise,
        |                                                                      ^
  0x7e2852d42ddf ???
        ./signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
  0x7e2852d2dd79 __libc_start_main
        ../csu/libc-start.c:308
  Please submit a full bug report,
  with preprocessed source if appropriate.
  Please include the complete backtrace with any bug report.
  See <file:///usr/share/doc/gcc-10/README.Bugs> for instructions.
  [17,006 / 19,133] Compiling xla/mlir_hlo/mhlo/IR/hlo_ops.cc; 99s local ... (24 actions, 23 running)
  Target //:_XLAC.so failed to build
  Use --verbose_failures to see the command lines of failed build steps.
  INFO: Elapsed time: 697.676s, Critical Path: 175.00s
  INFO: 17030 processes: 10181 internal, 6849 local.
  ERROR: Build did NOT complete successfully
  error: command '/usr/local/bin/bazel' failed with exit code 1
  error: subprocess-exited-with-error

While I'm not sure exactly what this is, it could be related to this gcc-10 bug.

Setup

Same image used in this CI run: https://github.com/pytorch/xla/actions/runs/17246611174/job/48937880278?pr=9588 Docker image: us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/development:3.12_tpuvm Docker image sha256: c194788bb5ea6d76806371a80c59c08f33c5ed0186a88e4e65a54245cf0a9014

Additional Context

  • Apparently, @qihqi also hit this error, which is why he updated gcc-10 to gcc-11 in #9565 (not on CI).
  • It's odd that CI doesn't end with this same error, even though the docker image used (thus, the compiler) is the same.

ysiraichi avatar Aug 27 '25 14:08 ysiraichi

I tried again, compiling PyTorch without CUDA support. Still didn't work.

ysiraichi avatar Aug 27 '25 15:08 ysiraichi

@ysiraichi is this still a problem? Is this why we don't have nightly torch-xla builds since 8/28?

jeffhataws avatar Sep 16 '25 17:09 jeffhataws

I don't think that's the reason, since CI is still running fine. Maybe @bhavya01 has some information on that.

ysiraichi avatar Sep 18 '25 13:09 ysiraichi

I am trying to debug with https://github.com/pytorch/xla/pull/9650

i.e. lets all use gcc-11

qihqi avatar Sep 20 '25 00:09 qihqi