[BUG] Can't use Ubuntu Buildkite images, seems like only Alpine is supported
Describe the bug
The container we need to run our CI tests upon needs to be either a Debian or Ubuntu based image due to some binary requirements we have for which Alpine is not well suited due it its use of musl. Thus, we create a base CI Docker image which has our requirements inside. In trying to understand the documentation/architecture it appears that the copy-agent init container must be Alpine (i.e. it must be from https://github.com/buildkite/agent/tree/main/packaging/docker/alpine and https://github.com/buildkite/agent/tree/main/packaging/docker/ubuntu-24.04 isn't supported). In the pod spec I can see that the copy-agent looks something like the following:
Init Containers:
copy-agent:
Image: ghcr.io/buildkite/agent@sha256:7e813c353bd315af56165c84837d221704ede175c9e0f715260df33ceb040231
Port: <none>
Host Port: <none>
Command:
ash
Args:
-cefx
cp /usr/local/bin/buildkite-agent /sbin/tini-static /workspace
Environment: <none>
Mounts:
/workspace from workspace (rw)
Note the use of ash and it appears to copy the tini-static and buildkite-agent into the shared volume. This all appears to imply that the only acceptable Buildkite Docker image is an Alpine one.
As an example, if I make the default Buildkite image a Ubuntu one with something like the following one will get an error:
# values.yml
agentToken: <SNIP>
graphqlToken: <SNIP>
config:
image: ghcr.io/buildkite/agent:3.97.0-ubuntu-24.04
org: <SNIP>
cluster-uuid: <SNIP>
tags:
- queue=kubernetes
pod-spec-patch:
containers:
- name: checkout
envFrom:
- secretRef:
name: github-ssh-authentication-key
The error will be:
The following init containers failed:
--
|
| CONTAINER EXIT CODE SIGNAL REASON MESSAGE
| copy-agent 128 0 StartError failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "ash": executable file not found in $PATH: unknown
I'm wondering if I have a fundamental misunderstanding for how this system works and how I can bring my own base CI container based on Ubuntu. As I understand, the base CI container shouldn't need the agent inside since it is copied, along with tini, and so the base CI image just needs the binaries and packages required for the command (or plugins) to run (i.e. aws CLI...).
To Reproduce
Steps to reproduce the behavior:
- Deploy with configuration
values.ymldefined above.
helm upgrade --install agent-stack-k8s oci://ghcr.io/buildkite/helm/agent-stack-k8s \
--namespace buildkite \
--create-namespace \
--values values.yml
- Target the queue with a pipeline and simple step with a command (i.e.
curl). - You will get an error like
failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "ash": executable file not found in $PATH: unknown
Expected behavior
I'd expect that any base Docker image offered by Buildkite (https://github.com/buildkite/agent/tree/main/packaging/docker) would be able to use and that the container actually running the job could be a Ubuntu or Debian based container.
Environment
- agent-stack-k8s version: 0.27.0
- Kubernetes version: 1.32
- Deployment method: Helm chart (see above)
Logs
The following init containers failed:
--
|
| CONTAINER EXIT CODE SIGNAL REASON MESSAGE
| copy-agent 128 0 StartError failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "ash": executable file not found in $PATH: unknown
Additional context
Add any other context about the problem here.
I think I figured out how it works. There are three containers and one init container. Namely,
container-0- It looks like this can be overridden with any POSIX shell container. For example, I tested the Debian Slim Python container below.
agent- This appears to be the actual agent container that acquires the job.
checkout- This appears to be the container that does the code checkout into the volume.
copy-agent--> init container- This appears to be the container for which
buildkite-agentandtini-staticare copied from. They appear to both be relatively portable and don't actually require Alpine.
- This appears to be the container for which
Thus, armed with this knowledge I think this is the correct way to bring your own CI container:
steps:
- label: "Check Python"
key: "check-python"
priority: 0
agents:
queue: "kubernetes"
plugins:
- kubernetes:
podSpecPatch:
containers:
- name: container-0
image: python:3.12.7-slim
commands:
- "python --version"
- "pip --version"
I am pretty sure that https://github.com/buildkite/agent-stack-k8s/issues/583#issuecomment-2851652548 is how it intended to work. We have decided to go with the podSpecPatch approach for container-0 for out custom CI images. This wasn't super clear from the documentation but it does work as we would like. As we have CI images for Python, Node, Go... and this allows us to run the correct CI image, with dependencies, on a per step basis.
+1
I am pretty sure that https://github.com/buildkite/agent-stack-k8s/issues/583#issuecomment-2851652548 is how it intended to work. We have decided to go with the podSpecPatch approach for container-0 for out custom CI images. This wasn't super clear from the documentation but it does work as we would like.
@marc-barry I happened to recently go through this same issue/process, and after raising with Buildkite support they suggested this exact approach: use the podSpecPatch to set their "main" Alpine image for these "reserved" containers (they suggested all checkout/agent/copy-agent) and can use our custom images otherwise.
Though I feel this issue is still worth consideration — all images should work, or we shouldn't be able to (easily/implicitly) set the images for these reserved containers, or at least have some explicit documentation around this.
@tcsullens thanks for validating my findings as well. I completely went in the wrong direction at first and created a bit of a mess. I like the concept of the "reserved" containers which must be of a specific image tag due to fundamental requirements that these reserved containers (or the commands ran within them) have.
Thus, as it stands, I "think" the reserved are:
checkoutagentcopy-agent
And the only customizable one is container-0 (at this time). I do think that https://github.com/buildkite/agent-stack-k8s/blob/main/docs/architecture.md might be trying to convey this. Although, they reference container-N user specified containers and I only have experience container-0. I'm not really sure what other containers with an integral N > 0 would be used for our how to use them.
👋 Hello @marc-barry @tcsullens
As you have discovered, changing config.image in the controller's configuration will change the image used by all containers, including the "reserved" containers used to orchestrate Buildkite jobs into Kubernetes Jobs:
copy-agentimagecheck-*agentcheckout
The copy-agent container and checkout container have their default command and args set to ash -cefx and ash -c, respectively. Additionally, the user-defined command containers also have their BUILDKITE_SHELL env var set to /bin/sh -ec. This is the reasoning behind the POSIX shell requirement at /bin/sh in any custom images. Providing an env var override to BUILDKITE_SHELL for any user-defined command images that are lacking this shell is how one might customize this. Scoping changes to the container-0 container's image via pod-spec-patch or podSpecPatch is the recommended approach to run a different container image for the Buildkite job commands.
Referring to the comment/question...
I'm not really sure what other containers with an integral N > 0 would be used for our how to use them.
If a full PodSpec is defined with multiple, unnamed containers each of these containers would be numbered as container-0, container-1, etc. when the controller processes the podSpec defined in the kubernetes plugin.
@petetomasik thanks for the clarification. I think we can go ahead and close this issue. It would be helpful to add this to the documentation at some point.
@marc-barry FYI with the release of v0.30 series, the best way to use custom image is:
steps:
- label: "Check Python"
key: "check-python"
priority: 0
agents:
queue: "kubernetes"
image: python:3.12.7-slim
commands:
- "python --version"
- "pip --version"
Doc. I hope it helps 😄.