pipeline icon indicating copy to clipboard operation
pipeline copied to clipboard

Misconfigured TaskRun pods cause PipelineRuns to silently timeout

Open aThorp96 opened this issue 1 month ago • 4 comments

Expected Behavior

A TaskRun with a TaskRun Pod that cannot start would either fail early, or when it times out the TaskRun's "Succeeded" failed condition should give some helpful message indicating that the TaskRun Pod could not start. Something like `TaskRun "configmap-4qrgb" failed to finish within "1m0s: pod status "PodReadyToStartContainers":"False";"

Actual Behavior

The Succeeded failed condition does not indicate that the Pod could not start: TaskRun "my-taskrun" failed to finish within "$timeout".

Steps to Reproduce the Problem

  1. Create the following TaskRun:
---
apiVersion: tekton.dev/v1
kind: TaskRun
metadata:
  generateName: missing-configmap-
spec:
  timeout: 1m
  taskSpec:
    steps:
    - name: secret
      image: mirror.gcr.io/ubuntu
      script: |
        #!/usr/bin/env bash
        [[ $(cat /config/test.data) == $TEST_DATA ]]
      env:
      - name: TEST_DATA
        valueFrom:
          configMapKeyRef:
            name: config-for-testing-configmaps
            key: test.data
      volumeMounts:
      - name: config-volume
        mountPath: /config
    volumes:
    - name: config-volume
      configMap:
        name: config-for-testing-configmaps
  1. Observe that the TaskRun Pod never starts, the TaskRun remains in a "Pending" instead of "Running", and that the timeout of the TaskRun doesn't indicate there was any issue starting the TaskRun pod

Additional Info

  • Kubernetes version:

    Output of kubectl version:

~/c/tektoncd-pipeline (main|✚1) $ kubectl version
Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v1.32.0
  • Tekton Pipeline version:

    Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

$ tkn version
Client version: 0.41.0
Pipeline version: v1.4.0
Dashboard version: v0.61.0

aThorp96 avatar Nov 17 '25 19:11 aThorp96

`TaskRun "configmap-4qrgb" failed to finish within "1m0s: pod status "PodReadyToStartContainers":"False";"

if it can provide the actual reason for the failure, in the case of your example the fact that a configmap is missing, it would be better than just saying the pod isn't ready to start the containers.

the message should contain information that the user can use to fix the issue (either in the infrastructure or on his/hers configurations).

gbenhaim avatar Dec 01 '25 18:12 gbenhaim

So there is two things to it:

  • Usage of configmap in env variables, using valueFrom. These can be detected and we could fail the TaskRun early.
  • Usage of configmap in volumes. These cannot be detected as they are recoverable, if the configmap appears, the Pod will run.

I am working on a fix for the former.

vdemeester avatar Dec 02 '25 09:12 vdemeester

Thanks @vdemeester

Usage of configmap in volumes. These cannot be detected as they are recoverable, if the configmap appears, the Pod will run.

In this case, I think it's understandable that we can't simply fail the TaskRun since the Pod would recover once the Configmap is created. However in that case, would it be possible to somehow propagate the pod's status into the TaskRun's status, so a user can see that the Pod failed to start?

aThorp96 avatar Dec 02 '25 13:12 aThorp96

In this case, I think it's understandable that we can't simply fail the TaskRun since the Pod would recover once the Configmap is created. However in that case, would it be possible to somehow propagate the pod's status into the TaskRun's status, so a user can see that the Pod failed to start?

Well, that's the thing, nothing from the Pod Status indicates that it's failing because the configmap doesn't exists. It does appears in the events but we are not watching them today (and it feels a little bit overkill to start watching them just for this). This would be architecturally questionable because:

  • Volume mount failures are intentionally recoverable by Kubernetes design
  • Adding early failure would break the recovery mechanism we just demonstrated
  • It would require additional API calls on every reconciliation
  • It goes against the separation of concerns (events are for humans/logging, status is for controllers)

For the env variable case, it will handled though.

vdemeester avatar Dec 02 '25 13:12 vdemeester