Misconfigured TaskRun pods cause PipelineRuns to silently timeout
Expected Behavior
A TaskRun with a TaskRun Pod that cannot start would either fail early, or when it times out the TaskRun's "Succeeded" failed condition should give some helpful message indicating that the TaskRun Pod could not start. Something like `TaskRun "configmap-4qrgb" failed to finish within "1m0s: pod status "PodReadyToStartContainers":"False";"
Actual Behavior
The Succeeded failed condition does not indicate that the Pod could not start: TaskRun "my-taskrun" failed to finish within "$timeout".
Steps to Reproduce the Problem
- Create the following TaskRun:
---
apiVersion: tekton.dev/v1
kind: TaskRun
metadata:
generateName: missing-configmap-
spec:
timeout: 1m
taskSpec:
steps:
- name: secret
image: mirror.gcr.io/ubuntu
script: |
#!/usr/bin/env bash
[[ $(cat /config/test.data) == $TEST_DATA ]]
env:
- name: TEST_DATA
valueFrom:
configMapKeyRef:
name: config-for-testing-configmaps
key: test.data
volumeMounts:
- name: config-volume
mountPath: /config
volumes:
- name: config-volume
configMap:
name: config-for-testing-configmaps
- Observe that the TaskRun Pod never starts, the TaskRun remains in a "Pending" instead of "Running", and that the timeout of the TaskRun doesn't indicate there was any issue starting the TaskRun pod
Additional Info
-
Kubernetes version:
Output of
kubectl version:
~/c/tektoncd-pipeline (main|✚1) $ kubectl version
Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v1.32.0
-
Tekton Pipeline version:
Output of
tkn versionorkubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'
$ tkn version
Client version: 0.41.0
Pipeline version: v1.4.0
Dashboard version: v0.61.0
`TaskRun "configmap-4qrgb" failed to finish within "1m0s: pod status "PodReadyToStartContainers":"False";"
if it can provide the actual reason for the failure, in the case of your example the fact that a configmap is missing, it would be better than just saying the pod isn't ready to start the containers.
the message should contain information that the user can use to fix the issue (either in the infrastructure or on his/hers configurations).
So there is two things to it:
- Usage of configmap in env variables, using
valueFrom. These can be detected and we could fail the TaskRun early. - Usage of configmap in volumes. These cannot be detected as they are recoverable, if the configmap appears, the Pod will run.
I am working on a fix for the former.
Thanks @vdemeester
Usage of configmap in volumes. These cannot be detected as they are recoverable, if the configmap appears, the Pod will run.
In this case, I think it's understandable that we can't simply fail the TaskRun since the Pod would recover once the Configmap is created. However in that case, would it be possible to somehow propagate the pod's status into the TaskRun's status, so a user can see that the Pod failed to start?
In this case, I think it's understandable that we can't simply fail the TaskRun since the Pod would recover once the Configmap is created. However in that case, would it be possible to somehow propagate the pod's status into the TaskRun's status, so a user can see that the Pod failed to start?
Well, that's the thing, nothing from the Pod Status indicates that it's failing because the configmap doesn't exists. It does appears in the events but we are not watching them today (and it feels a little bit overkill to start watching them just for this). This would be architecturally questionable because:
- Volume mount failures are intentionally recoverable by Kubernetes design
- Adding early failure would break the recovery mechanism we just demonstrated
- It would require additional API calls on every reconciliation
- It goes against the separation of concerns (events are for humans/logging, status is for controllers)
For the env variable case, it will handled though.