easybuild-easyconfigs icon indicating copy to clipboard operation
easybuild-easyconfigs copied to clipboard

fix failing build for tensorstore 0.1.72 when using RPATH by passing `$TMPDIR` from host into Bazel sandbox

Open boegel opened this issue 10 months ago • 22 comments

(created using eb --new-pr)

fix for failing installation when using RPATH linking:

  Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
  src/main/tools/linux-sandbox-pid1.cc:548: "execvp(/tmp/eb-zilp5yc9/tmpr0fpmr63/rpath_wrappers/gcc_wrapper/gcc, 0x1d148c0)": No such file or directory
  Target //python/tensorstore:_tensorstore__shared_objects failed to build

boegel avatar Jun 18 '25 20:06 boegel

@boegelbot please test @ jsc-zen3

boegel avatar Jun 18 '25 20:06 boegel

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=23139 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_23139 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 6894

Test results coming soon (I hope)...

- notification for comment with ID 2985605127 processed

Message to humans: this is just bookkeeping information for me, it is of no use to you (unless you think I have a bug, which I don't).

boegelbot avatar Jun 18 '25 20:06 boegelbot

Test report by @boegelbot FAILED Build succeeded for 0 out of 1 (1 easyconfigs in total) jsczen3c2.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.5, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.21 See https://gist.github.com/boegelbot/e8082b7c102df5ae5e0ac360ea926a7d for a full test report.

boegelbot avatar Jun 18 '25 20:06 boegelbot

From failed test report:

  In file included from external/com_google_protobuf/src/google/protobuf/io/gzip_stream.cc:15:
  bazel-out/k8-opt-exec-ST-a828a81199fe/bin/external/com_google_protobuf/src/google/protobuf/io/_virtual_includes/gzip_stream/google/protobuf/io/gzip_stream.h:26:10: fatal error: zlib.h: No such file or directory
     26 | #include <zlib.h>
        |          ^~~~~~~~
  compilation terminated.

boegel avatar Jun 18 '25 20:06 boegel

Test report by @boegel SUCCESS Build succeeded for 1 out of 1 (1 easyconfigs in total) node3515.doduo.os - Linux RHEL 9.4, x86_64, AMD EPYC 7552 48-Core Processor (zen2), Python 3.9.18 See https://gist.github.com/boegel/fa6124e13a5e1ab155eef698b05fec1a for a full test report.

boegel avatar Jun 18 '25 20:06 boegel

Test report by @akesandgren SUCCESS Build succeeded for 2 out of 2 (1 easyconfigs in total) b-cn1613.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 7313 16-Core Processor, Python 3.10.12 See https://gist.github.com/akesandgren/c9bbda768d0fa550708d4d5aba0699a1 for a full test report.

akesandgren avatar Jun 19 '25 05:06 akesandgren

From failed test report:

  In file included from external/com_google_protobuf/src/google/protobuf/io/gzip_stream.cc:15:
  bazel-out/k8-opt-exec-ST-a828a81199fe/bin/external/com_google_protobuf/src/google/protobuf/io/_virtual_includes/gzip_stream/google/protobuf/io/gzip_stream.h:26:10: fatal error: zlib.h: No such file or directory
     26 | #include <zlib.h>
        |          ^~~~~~~~
  compilation terminated.

No idea what's going on here... --copt=-I$EBROOTZLIB/include is being passed via TENSORSTORE_BAZEL_BUILD_OPTIONS, but that doesn't seem to be sufficient to make it pick up the zlib.h provided by the zlib dependency?!

boegel avatar Jun 19 '25 06:06 boegel

There never was a failing test report from the bot in the original PR for this easyconfig, so the problem is not new:

  • #22476

boegel avatar Jun 19 '25 06:06 boegel

I wonder if the renaming of net_zlib to zlib (we use the former in TENSORSTORE_SYSTEM_LIBS) has something to do with this... This was only done in tensorstore v0.1.75 though (see https://github.com/google/tensorstore/commit/2a3e7864d767ba702849cc0689ff2584b5c10379), so surely it doesn't affect previous versions... Right?

edit: renaming net_zlib to zlib doesn't help, leads to:

ERROR: no such package '@@net_zlib//': java.io.IOException: Error downloading ..

boegel avatar Jun 19 '25 06:06 boegel

Test report by @jfgrimm SUCCESS Build succeeded for 1 out of 1 (1 easyconfigs in total) node106.viking2.yor.alces.network - Linux Rocky Linux 8.9, x86_64, AMD EPYC 7643 48-Core Processor, Python 3.6.8 See https://gist.github.com/jfgrimm/9b1552150acdc221bb0b6896b077040d for a full test report.

jfgrimm avatar Jun 19 '25 09:06 jfgrimm

--subcommands --verbose_failures are passed to Bazel in TF to show the commands being execute/failed which might help diagnosing the issue.

In the TF easyblock we also use --action_env=CPATH=$EBROOTFOO:$EBROOTBAR and for Bazel >= 3.7 duplicate that into --host-action_env

Flamefire avatar Jun 20 '25 08:06 Flamefire

@boegelbot please test @ jsc-zen3

boegel avatar Nov 05 '25 14:11 boegel

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=23139 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_23139 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 8655

Test results coming soon (I hope)...

- notification for comment with ID 3491433595 processed

Message to humans: this is just bookkeeping information for me, it is of no use to you (unless you think I have a bug, which I don't).

boegelbot avatar Nov 05 '25 14:11 boegelbot

Test report by @boegelbot FAILED Build succeeded for 0 out of 1 (total: 4 mins 39 secs) (1 easyconfigs in total) jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.6, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.21 See https://gist.github.com/boegelbot/b8e9b8ad39b7ca3af087a782238bd576 for a full test report.

boegelbot avatar Nov 05 '25 14:11 boegelbot

--copt=-I$EBROOTZLIB/include is being passed via TENSORSTORE_BAZEL_BUILD_OPTIONS, but that doesn't seem to be sufficient to make it pick up the zlib.h provided by the zlib dependency?!

Depends on the environment used by Bazel

In the TF easyblock we also use --action_env=CPATH=$EBROOTFOO:$EBROOTBAR and for Bazel >= 3.7 duplicate that into --host-action_env

As here Bazel 7 is used it might require --host-copt if the failure occurs in the host/exec environment/configuration

Flamefire avatar Nov 05 '25 14:11 Flamefire

@boegelbot please test @ jsc-zen3

boegel avatar Dec 16 '25 06:12 boegel

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=23139 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_23139 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 9192

Test results coming soon (I hope)...

- notification for comment with ID 3659090380 processed

Message to humans: this is just bookkeeping information for me, it is of no use to you (unless you think I have a bug, which I don't).

boegelbot avatar Dec 16 '25 07:12 boegelbot

Test report by @boegelbot FAILED Build succeeded for 0 out of 1 (total: 3 mins 42 secs) (1 easyconfigs in total) jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.6, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.21 See https://gist.github.com/boegelbot/e1d6c1db69da71d75411d4f91e7af53d for a full test report.

boegelbot avatar Dec 16 '25 07:12 boegelbot

Test report by @Flamefire SUCCESS Build succeeded for 1 out of 1 (total: 5 mins 4 secs) (1 easyconfigs in total) c144 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 580.65.06, Python 3.9.21 See https://gist.github.com/Flamefire/6f77b4d24bf8501bcc26db8b14dbbe5f for a full test report.

Flamefire avatar Dec 16 '25 13:12 Flamefire

Test report by @Flamefire SUCCESS Build succeeded for 6 out of 6 (total: 8 mins 32 secs) (1 easyconfigs in total) i7014 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7702 64-Core Processor (zen2), Python 3.9.21 See https://gist.github.com/Flamefire/8f4498e3db5159c7c7618c6c3e396c8f for a full test report.

Flamefire avatar Dec 16 '25 13:12 Flamefire

It would be good to see the failing command in the error (again) as it is missing in the log. What do you think about https://github.com/easybuilders/easybuild-framework/pull/5074 ?

But comparing the failing GCC invocations they are literally identical. Running it with -E instead of -c reveals it is including /usr/include/zlib.h here.

Flamefire avatar Dec 16 '25 15:12 Flamefire

https://github.com/easybuilders/easybuild-easyconfigs/pull/24896 contains both fixes.

https://github.com/boegel/easybuild-easyconfigs/pull/100 would merge it to your branch if you want to keep it in this PR

Flamefire avatar Dec 16 '25 16:12 Flamefire