pipeline
pipeline copied to clipboard
Results, TerminationMessage and Containers
While looking into #4529, I stumbled on a relatively big shortcomings of our current implementation of Result.
The initial problem is highlighted in one of our customer's Task that write results. From one version to the other, the same exact task would fail to write the result. Digging a bit into the issue, the culprit commit was e6399ce1d, which naively made no sense at all. The only hint there is that we are adding a new initContainer.
In some doc around the TerminationMessage behaviour, we can read the following.
You can customize the terminationMessagePath field of a container for Kubernetes to use the content of the specified custom file to fulfill the termination message of the container when the container running process completes or fails. The maximum size of a termination message is 4 KB.
This is obviously what we document in Emitting Results. But there is more.
The total size of termination messages of all containers in a pod cannot exceed 12 KB. If the total size exceeds 12 KB, the state manager of Kubernetes sets a limit on the termination message sizes. For example, if a pod contains four InitContainers and eight application containers, the state manager limits the termination message of each container to 1 KB. This indicates that only the first 1 KB of the termination message of each container is intercepted.
Highlighted in bold is the reason of the failure. In a gist, the more container we have in our pod, the smaller the size of one container's result is. This is an issue because it means the maximum size of results depend on the number of container, meaning it also depend on the number of internal container (place-tools, …).
This might have an impact on some TEPs:
- https://github.com/tektoncd/community/blob/main/teps/0075-object-param-and-result-types.md
- https://github.com/tektoncd/community/blob/main/teps/0076-array-result-types.md
This also highlight why https://github.com/tektoncd/community/pull/521 or at least some thinking around those are.
Next step for this particular issue could be:
- [x] Enhance the documentation around the limitation of Results (depending on containers, …)
- [ ] Reduce the number of init container to the strict minimum
- [ ] Error out more properly when the termination message is truncated. Today we error out by saying we didn't find the result, whereas we could probably detect that the message in json is invalid (because truncated)
- [ ] Split results over multiple steps in case of multiple results — maybe by having results per step and
TaskResultssubmiting all ?
/cc @tektoncd/core-maintainers @abayer @imjasonh
That phrasing of the limit is surprising to me. The vanilla k8s docs say:
The termination message is intended to be brief final status, such as an assertion failure message. The kubelet truncates messages that are longer than 4096 bytes. The total message length across all containers will be limited to 12KiB. The default termination message path is /dev/termination-log. You cannot set the termination message path after a Pod is launched
...with no mention of how each container's message is limited.
My read of the k8s docs is that there's a 12KB total limit, but not that it's divided among containers in any way. It might be limited that way, but that's not documented.
That's not to say this isn't still a big issue that we should tackle soon (and plan to, AFAIK), just that maybe AlibabaCloud's docs are overly/incorrectly prescriptive about how that limitation is applied. Or maybe their platform does apply it that way, but vanilla k8s doesn't.
If someone tests this and we can determine that each container's message is limited to 12KB/${numContainers} on all k8s platforms, that certainly bumps the priority for fixing this.
If someone tests this and we can determine that each container's message is limited to
12KB/${numContainers}on all k8s platforms, that certainly bumps the priority for fixing this.
This is what is happening today and why I opened the bug. Adding an extra initContainer (even without any termination message setup, …) did reduce the size of the message for one container (and this is what happened in 0.32.0).
As a reproducer, see the following resources:
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
name: generate-result
spec:
params:
- name: STRING_LENGTH
description: Length of the string to create
- name: STRING_CHAR
description: Char to use when creating string
type: string
default: '.'
results:
- name: RESULT_STRING
description: A result string
steps:
- name: gen-result
image: bash:latest
env:
- name: PARAM_STRING_LENGTH
value: $(params.STRING_LENGTH)
- name: PARAM_STRING_CHAR
value: $(params.STRING_CHAR)
script: |
#! /usr/bin/bash
set -e
len=$PARAM_STRING_LENGTH
ch=$PARAM_STRING_CHAR
printf '%*s' "$len" | tr ' ' "$ch" >> $(results.RESULT_STRING.path)
---
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
name: print-result
spec:
params:
- name: TO_PRINT
type: string
steps:
- name: print-result
image: bash:latest
env:
- name: PARAM_TO_PRINT
value: $(params.TO_PRINT)
script: |
#! /usr/bin/bash
set -e
echo $PARAM_TO_PRINT
---
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
labels:
app: sample
name: result-test
spec:
params:
- name: RESULT_STRING_LENGTH
description: Length of string to generate for generate-result task
- name: RESULT_STRING_CHAR
description: Char to repeat in result string
default: '.'
tasks:
- name: generate-result
params:
- name: STRING_LENGTH
value: $(params.RESULT_STRING_LENGTH)
- name: STRING_CHAR
value: $(params.RESULT_STRING_CHAR)
taskRef:
kind: Task
name: generate-result
- name: print-result
params:
- name: TO_PRINT
value: $(tasks.generate-result.results.RESULT_STRING)
taskRef:
kind: Task
name: print-result
Running the following pipeline with a RESULT_STRING_LENGTH of 3000 (~3K) does fail on >= 0.32, but suceeds before.
tkn pipeline start result-test -p RESULT_STRING_LENGTH=3000 --use-param-defaults
Oh I see, thanks for clarifying. In that case, it sounds like we should update the k8s docs so folks don't trip over this later. I'm sure I'll forget and need a reminder 😅
This also definitely bumps up the priority for larger results and getting off terminationMessages.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.
/lifecycle stale
Send feedback to tektoncd/plumbing.
/lifecycle frozen