rerun icon indicating copy to clipboard operation
rerun copied to clipboard

Better structure for our CI

Open jprochazk opened this issue 1 year ago • 2 comments

I've been very casually looking into how big projects do CI, and I think what rust-lang/rust does is really interesting

TL;DR: We could significantly reduce the pain involved in working on our CI by adopting a similar approach.

The rust-lang/rust approach consists of:

For us, this could remove some major pain points of working with CI, without straying too far from GHA and requiring that we all learn Bazel instead. :sweat_smile:

  • Jobs could be far more structured, consistent, and easy to define than they currently are, using a little-known YAML feature called merge keys (https://yaml.org/type/merge.html) to deduplicate job definitions
  • For consistency, each job definition only includes a docker image, a script to run, and some environment variables (which by default are completely empty!!)
  • Job runtime dependencies are implicit in the docker image they use, and they are automatically cached inside the docker image. We could still separately use sccache and/or manually write to GH cache/artifact store to store built artifacts
  • Every job is now always runnable locally
  • Every job is completely isolated from whatever the runner happens to have installed on it, meaning we could swap to any service that provides faster/cheaper GHA runners with zero worry that something could break
  • We could share code between jobs by putting that shared code in a script, and between job definitions by merging in the same key. No more reusable_thing.yml!
  • We'd still get to take advantage of GHA's runner parallelism with a dynamic job matrix

Best of all, a transition to this approach could be very incremental, and we could reuse a lot of our existing code. All of the bits of bash/python/whatever else we have littered around the workflow files, we could consolidate into actual scripts.

jprochazk avatar Aug 14 '24 08:08 jprochazk

Hi @jprochazk, stopping by here at Nikolaus's suggestion: I agree what you describe will produce an incrementally better CI workflow; I've lived with 30-50 developers at 3 previous companies that had setups as you describe, leveraging the Docker ecosystem and cloud image build systems in various ways (Drone, Kaniko, etc), deployed all via Kubernetes into all of the cloud providers. It works well enough, but I found it to be a significant maintenance burden, eventually requiring dedicated developers and even teams to keep it alive and build times fast enough for developer happiness. This was especially true for Rust builds, where the Docker layer cache really fails to deliver.

At Elodin, some of the developers from the previous team joined me with those lessons learned, and we've now settled on a setup using BuildKite, which is like Drone but "just works", and going all-in on NixOS. We use Nix to build our images and mount them into containers for use in our Kubernetes deploys. While learning Nix is a significant decision, the result is undeniable. We are building 10+ Rust binaries every commit and have fully deployed dev environments in under 3-4 minutes. It's easily the best setup I've used in 15 years of suffering with various CI setups, and the important aspect is the maintenance burden is significantly less. Working with Nix is a pain shared across the team, but the result is CI is truly tested by the individual developer, and so you don't need an upstream team to keep the pipelines moving.

Enough soapbox from me! Just happy to be helpful if at all useful. Happy to give you access to our repo for a time to take a look at how it all works in our case; otherwise I wish you luck at achieving developer happiness!

x46085 avatar Sep 26 '24 15:09 x46085

Hi @x46085, I'd love to snoop around your setup and see how you use Nix!

I definitely want to switch us over to Nix if/when possible. I have played around with it before, and I agree that despite the learning curve, it's ultimately a huge time saver, and just so nice to work with.

I've brought up the idea of switching to Nix to the team in the past, and the main issue we ran into was the lack of Windows support. We intentionally have people daily driving (at least part-time) all of Linux, Mac, and Windows, to ensure that what we're building actually works on those platforms. There's an on-going effort to support Nix on Windows via MinGW, and while I'm not too familiar with the status of that right now, it doesn't look like it's quite ready yet.

jprochazk avatar Sep 26 '24 16:09 jprochazk