actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

Include status message in error when a workflow pod fails to start

Open abonander opened this issue 2 years ago • 4 comments

What would you like added?

When a containerized job pod using containerMode: kubernetes fails to start, all we get is an error like this:

 Error: Pod failed to come online with error: Error: Pod <runner pod>-workflow is unhealthy with phase status Failed

It would be extremely helpful if this also reported the message field of PodStatus, which often explains why the pod failed to start, e.g.:

status:
  message: 'Pod was rejected: Predicate NodePorts failed'
  phase: Failed
  reason: NodePorts
  startTime: "2023-10-16T23:59:11Z"

Why is this needed?

Without this, it is very annoying to debug a workflow pod failing to start, as the pod is automatically removed so manual inspection is pretty much impossible.

I had to run kubernetes -n <runner namespace> get pods -o yaml -w and scan the output for the failing pods to find the actual reason they were failing.

kubernetes -n <runner namespace> get pods -w was theoretically sufficient, but it was just showing the pods transitioning through statuses ContainerCreating -> NodePorts -> Terminating and Google was very unhelpful in trying to figure out what the NodePorts status meant.

Either way, that step would not be necessary if the runner just logged the message.

Additional context

In this case, it is because the job in question was originally written for the Github hosted runner, using both a matrix and a service container with an exposed port (which I didn't realize was not necessary when switching to a containerized job), e.g.:

  test-with-redis:
    runs-on: arc-scaleset-name
    timeout-minutes: 10
    strategy:
      matrix:
        foo: [3, 6, 10]
    # Our self-hosted runners use Kubernetes mode, which requires that the job use a container if it needs services.
    container: ubuntu:latest
    services:
      redis:
        image: redis
        # This is not necessary with a containerized job but I didn't realize that.
        ports:
          - 6379:6379
    steps:
        - # ...  

Thus, when it tried to run multiple jobs for the matrix concurrently on the same node, it was failing due to the conflict in the exposed ports, which surfaces as a Predicate NodePorts failed error (which immediately gave a useful result when Googled).

This is clearly very easy to run into when converting a job from a Github-hosted runner, which does not need to be containerized to use services, to a self-hosted runner using containerMode: kubernetes.

abonander avatar Oct 17 '23 00:10 abonander