actions-runner-controller
actions-runner-controller copied to clipboard
Include status message in error when a workflow pod fails to start
What would you like added?
When a containerized job pod using containerMode: kubernetes fails to start, all we get is an error like this:
Error: Pod failed to come online with error: Error: Pod <runner pod>-workflow is unhealthy with phase status Failed
It would be extremely helpful if this also reported the message field of PodStatus, which often explains why the pod failed to start, e.g.:
status:
message: 'Pod was rejected: Predicate NodePorts failed'
phase: Failed
reason: NodePorts
startTime: "2023-10-16T23:59:11Z"
Why is this needed?
Without this, it is very annoying to debug a workflow pod failing to start, as the pod is automatically removed so manual inspection is pretty much impossible.
I had to run kubernetes -n <runner namespace> get pods -o yaml -w and scan the output for the failing pods to find the actual reason they were failing.
kubernetes -n <runner namespace> get pods -w was theoretically sufficient, but it was just showing the pods transitioning through statuses ContainerCreating -> NodePorts -> Terminating and Google was very unhelpful in trying to figure out what the NodePorts status meant.
Either way, that step would not be necessary if the runner just logged the message.
Additional context
In this case, it is because the job in question was originally written for the Github hosted runner, using both a matrix and a service container with an exposed port (which I didn't realize was not necessary when switching to a containerized job), e.g.:
test-with-redis:
runs-on: arc-scaleset-name
timeout-minutes: 10
strategy:
matrix:
foo: [3, 6, 10]
# Our self-hosted runners use Kubernetes mode, which requires that the job use a container if it needs services.
container: ubuntu:latest
services:
redis:
image: redis
# This is not necessary with a containerized job but I didn't realize that.
ports:
- 6379:6379
steps:
- # ...
Thus, when it tried to run multiple jobs for the matrix concurrently on the same node, it was failing due to the conflict in the exposed ports, which surfaces as a Predicate NodePorts failed error (which immediately gave a useful result when Googled).
This is clearly very easy to run into when converting a job from a Github-hosted runner, which does not need to be containerized to use services, to a self-hosted runner using containerMode: kubernetes.