rules_ll icon indicating copy to clipboard operation
rules_ll copied to clipboard

Cannot run tests in CI

Open aaronmondal opened this issue 2 years ago • 4 comments

Attempts to run the tests in CI via remote execution currently doesn't work because Bazel doesn't like to run in a nix-built container. build and run works, but test doesn't, most likely due to https://github.com/bazelbuild/bazel/issues/12579.

Technically it's already decent coverage if just builds pass, but many issues arise from dynamic linking behavior and are only visible during runtime. So at the moment we'd either have to run all examples manually without the ll_test wrappers, or only run a bazel build cpp without running anything.

Another option would be to build a custom Bazel which we distribute as part of rules_ll. Building a custom Bazel against an LLVM toolchain and statically linking libc++ could be an option that keeps things portable between CI and regular usage, but it might lead to issues for non-nix workflows.

@JannisFengler @SpamDoodler @jaroeichler What do you think? Statically linking Bazel with libc++ would add a few MB to all images, caches, the devenv etc because we'd have duplicate libc++ functions in every subbinary and we'd have to thinkg about infrastructure to support staying upstream with the bazel sources. That would make it easier to get remote execution to work though. Do we want to go down that path or should we try to find another solution?

aaronmondal avatar Mar 24 '23 16:03 aaronmondal

Personally, I think we actually want to have a custom build Bazel. This would give us even better control over the whole build environment. Concerning the issues you raised, I think it's ok to prioritize the nix workflow for now, since I don't see a huge drawback with using nix. The few MB for libc++ should be no problem either, since storage is quite inexpensive. I think it is the spirit of rules_ll to provide the most advanced toolchain possible, so I'm happy to prioritize remote execution over those minor inconveniences.

SpamDoodler avatar Mar 25 '23 04:03 SpamDoodler

Statically linking libc++ is fine, those few MB are neglectable in comparison to the cache and nix environment size. We should keep dynamical linking in mind if image size becomes an issue in the future.

Yes, we should aim for a custom build Bazel, this aligns well with the rest of the rules_ll project. Could you go into further detail on how you want to handle the patching and building of Bazel?

jaroeichler avatar Mar 25 '23 11:03 jaroeichler

@jaroeichler I initially tried just patching the RPATHs with patchelf, but then bazel refuses to operate. Probably for security reasons.

My current plan is:

  • Write a nix package that builds bazel via the non-upstream llvm toolchain from nixpkgs and statically link libc++ into it.
  • Distribute that Bazel in a way that is compatible with Bazelisk's custom release mechanism outlined here.
  • Fetch the binary in our remote execution images via bazelisk.

If things work as i intend, we'd end up with remote execution images that no longer require libstdc++ or any gcc-toolchain parts. If we can reference these custom Bazel binaries in .bazelversion this approach should also be portable to non-nix users as long as the LLVM toolchain parts are statically linked.

As an interesting sidenote we could also try to statically link libmusl into that release to create a fat binary that is independent of the host's glibc version. But let's leave this for later when things actually work 😅

aaronmondal avatar Mar 25 '23 14:03 aaronmondal

Ok remote execution works, so we could run tests in CI. But that might be really expensive. A single build with near perfect cache reuse (which we basically always have) still needs ~2GB of artifacts to operate (makes sense, building a single target requires the tools from the ll_toolchain, which is roughly that size). At 1 commit per day this is ~60GB just for the main branch. This does not include any PR testing etc. This is also a minimum value. For instance, updating LLVM alone which requires a full cache rebuild and a few revisions might be many times larger than that.

We probably still need a fraction of the resources that others would need for a similar setup, but it's still a big setup. We it might be better off hosting our own remote exec cluster.

  • #91

aaronmondal avatar Apr 17 '23 01:04 aaronmondal