agent-stack-k8s [BUG] Can't use Ubuntu Buildkite images, seems like only Alpine is supported

Describe the bug

The container we need to run our CI tests upon needs to be either a Debian or Ubuntu based image due to some binary requirements we have for which Alpine is not well suited due it its use of musl. Thus, we create a base CI Docker image which has our requirements inside. In trying to understand the documentation/architecture it appears that the copy-agent init container must be Alpine (i.e. it must be from https://github.com/buildkite/agent/tree/main/packaging/docker/alpine and https://github.com/buildkite/agent/tree/main/packaging/docker/ubuntu-24.04 isn't supported). In the pod spec I can see that the copy-agent looks something like the following:

Init Containers:
   copy-agent:
    Image:      ghcr.io/buildkite/agent@sha256:7e813c353bd315af56165c84837d221704ede175c9e0f715260df33ceb040231
    Port:       <none>
    Host Port:  <none>
    Command:
      ash
    Args:
      -cefx
      
      cp /usr/local/bin/buildkite-agent /sbin/tini-static /workspace
      
    Environment:  <none>
    Mounts:
      /workspace from workspace (rw)

Note the use of ash and it appears to copy the tini-static and buildkite-agent into the shared volume. This all appears to imply that the only acceptable Buildkite Docker image is an Alpine one.

As an example, if I make the default Buildkite image a Ubuntu one with something like the following one will get an error:

# values.yml
agentToken: <SNIP>
graphqlToken: <SNIP>
config:
  image: ghcr.io/buildkite/agent:3.97.0-ubuntu-24.04
  org: <SNIP>
  cluster-uuid: <SNIP>
  tags:
    - queue=kubernetes
  pod-spec-patch:
    containers:
    - name: checkout
      envFrom:
      - secretRef:
          name: github-ssh-authentication-key

The error will be:

The following init containers failed:
--
  |  
  | CONTAINER   EXIT CODE  SIGNAL  REASON      MESSAGE
  | copy-agent        128       0  StartError  failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "ash": executable file not found in $PATH: unknown

I'm wondering if I have a fundamental misunderstanding for how this system works and how I can bring my own base CI container based on Ubuntu. As I understand, the base CI container shouldn't need the agent inside since it is copied, along with tini, and so the base CI image just needs the binaries and packages required for the command (or plugins) to run (i.e. aws CLI...).

To Reproduce

Steps to reproduce the behavior:

Deploy with configuration values.yml defined above.

helm upgrade --install agent-stack-k8s oci://ghcr.io/buildkite/helm/agent-stack-k8s \
    --namespace buildkite \
    --create-namespace \
    --values values.yml

Target the queue with a pipeline and simple step with a command (i.e. curl).
You will get an error like failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "ash": executable file not found in $PATH: unknown

Expected behavior

I'd expect that any base Docker image offered by Buildkite (https://github.com/buildkite/agent/tree/main/packaging/docker) would be able to use and that the container actually running the job could be a Ubuntu or Debian based container.

Environment

agent-stack-k8s version: 0.27.0
Kubernetes version: 1.32
Deployment method: Helm chart (see above)

Logs

The following init containers failed:
--
  |  
  | CONTAINER   EXIT CODE  SIGNAL  REASON      MESSAGE
  | copy-agent        128       0  StartError  failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "ash": executable file not found in $PATH: unknown

Additional context

Add any other context about the problem here.

May 05 '25 16:05 marc-barry

I think I figured out how it works. There are three containers and one init container. Namely,

container-0
- It looks like this can be overridden with any POSIX shell container. For example, I tested the Debian Slim Python container below.
agent
- This appears to be the actual agent container that acquires the job.
checkout
- This appears to be the container that does the code checkout into the volume.
copy-agent --> init container
- This appears to be the container for which buildkite-agent and tini-static are copied from. They appear to both be relatively portable and don't actually require Alpine.

Thus, armed with this knowledge I think this is the correct way to bring your own CI container:

steps:
  - label: "Check Python"
    key: "check-python"
    priority: 0
    agents:
      queue: "kubernetes"
    plugins:
      - kubernetes:
          podSpecPatch:
            containers:
            - name: container-0
              image: python:3.12.7-slim
    commands:
      - "python --version"
      - "pip --version"

May 05 '25 17:05 marc-barry

I am pretty sure that https://github.com/buildkite/agent-stack-k8s/issues/583#issuecomment-2851652548 is how it intended to work. We have decided to go with the podSpecPatch approach for container-0 for out custom CI images. This wasn't super clear from the documentation but it does work as we would like. As we have CI images for Python, Node, Go... and this allows us to run the correct CI image, with dependencies, on a per step basis.

May 07 '25 13:05 marc-barry

+1

I am pretty sure that https://github.com/buildkite/agent-stack-k8s/issues/583#issuecomment-2851652548 is how it intended to work. We have decided to go with the podSpecPatch approach for container-0 for out custom CI images. This wasn't super clear from the documentation but it does work as we would like.

@marc-barry I happened to recently go through this same issue/process, and after raising with Buildkite support they suggested this exact approach: use the podSpecPatch to set their "main" Alpine image for these "reserved" containers (they suggested all checkout/agent/copy-agent) and can use our custom images otherwise.

Though I feel this issue is still worth consideration — all images should work, or we shouldn't be able to (easily/implicitly) set the images for these reserved containers, or at least have some explicit documentation around this.

May 07 '25 15:05 tcsullens

@tcsullens thanks for validating my findings as well. I completely went in the wrong direction at first and created a bit of a mess. I like the concept of the "reserved" containers which must be of a specific image tag due to fundamental requirements that these reserved containers (or the commands ran within them) have.

Thus, as it stands, I "think" the reserved are:

checkout
agent
copy-agent

And the only customizable one is container-0 (at this time). I do think that https://github.com/buildkite/agent-stack-k8s/blob/main/docs/architecture.md might be trying to convey this. Although, they reference container-N user specified containers and I only have experience container-0. I'm not really sure what other containers with an integral N > 0 would be used for our how to use them.

May 07 '25 16:05 marc-barry

👋 Hello @marc-barry @tcsullens

As you have discovered, changing config.image in the controller's configuration will change the image used by all containers, including the "reserved" containers used to orchestrate Buildkite jobs into Kubernetes Jobs:

copy-agent
imagecheck-*
agent
checkout

The copy-agent container and checkout container have their default command and args set to ash -cefx and ash -c, respectively. Additionally, the user-defined command containers also have their BUILDKITE_SHELL env var set to /bin/sh -ec. This is the reasoning behind the POSIX shell requirement at /bin/sh in any custom images. Providing an env var override to BUILDKITE_SHELL for any user-defined command images that are lacking this shell is how one might customize this. Scoping changes to the container-0 container's image via pod-spec-patch or podSpecPatch is the recommended approach to run a different container image for the Buildkite job commands.

Referring to the comment/question...

I'm not really sure what other containers with an integral N > 0 would be used for our how to use them.

If a full PodSpec is defined with multiple, unnamed containers each of these containers would be numbered as container-0, container-1, etc. when the controller processes the podSpec defined in the kubernetes plugin.

May 09 '25 16:05 petetomasik

@petetomasik thanks for the clarification. I think we can go ahead and close this issue. It would be helpful to add this to the documentation at some point.

May 15 '25 16:05 marc-barry

@marc-barry FYI with the release of v0.30 series, the best way to use custom image is:

steps:
  - label: "Check Python"
    key: "check-python"
    priority: 0
    agents:
      queue: "kubernetes"
    image: python:3.12.7-slim
    commands:
      - "python --version"
      - "pip --version"

Doc. I hope it helps 😄.

Jul 29 '25 05:07 zhming0