pipeline Results, TerminationMessage and Containers

While looking into #4529, I stumbled on a relatively big shortcomings of our current implementation of Result.

The initial problem is highlighted in one of our customer's Task that write results. From one version to the other, the same exact task would fail to write the result. Digging a bit into the issue, the culprit commit was e6399ce1d, which naively made no sense at all. The only hint there is that we are adding a new initContainer.

In some doc around the TerminationMessage behaviour, we can read the following.

You can customize the terminationMessagePath field of a container for Kubernetes to use the content of the specified custom file to fulfill the termination message of the container when the container running process completes or fails. The maximum size of a termination message is 4 KB.

This is obviously what we document in Emitting Results. But there is more.

The total size of termination messages of all containers in a pod cannot exceed 12 KB. If the total size exceeds 12 KB, the state manager of Kubernetes sets a limit on the termination message sizes. For example, if a pod contains four InitContainers and eight application containers, the state manager limits the termination message of each container to 1 KB. This indicates that only the first 1 KB of the termination message of each container is intercepted.

Highlighted in bold is the reason of the failure. In a gist, the more container we have in our pod, the smaller the size of one container's result is. This is an issue because it means the maximum size of results depend on the number of container, meaning it also depend on the number of internal container (place-tools, …).

This might have an impact on some TEPs:

https://github.com/tektoncd/community/blob/main/teps/0075-object-param-and-result-types.md
https://github.com/tektoncd/community/blob/main/teps/0076-array-result-types.md

This also highlight why https://github.com/tektoncd/community/pull/521 or at least some thinking around those are.

Next step for this particular issue could be:

[x] Enhance the documentation around the limitation of Results (depending on containers, …)
[ ] Reduce the number of init container to the strict minimum
[ ] Error out more properly when the termination message is truncated. Today we error out by saying we didn't find the result, whereas we could probably detect that the message in json is invalid (because truncated)
[ ] Split results over multiple steps in case of multiple results — maybe by having results per step and TaskResults submiting all ?

Apr 28 '22 15:04 vdemeester

/cc @tektoncd/core-maintainers @abayer @imjasonh

Apr 28 '22 16:04 vdemeester

That phrasing of the limit is surprising to me. The vanilla k8s docs say:

The termination message is intended to be brief final status, such as an assertion failure message. The kubelet truncates messages that are longer than 4096 bytes. The total message length across all containers will be limited to 12KiB. The default termination message path is /dev/termination-log. You cannot set the termination message path after a Pod is launched

...with no mention of how each container's message is limited.

My read of the k8s docs is that there's a 12KB total limit, but not that it's divided among containers in any way. It might be limited that way, but that's not documented.

That's not to say this isn't still a big issue that we should tackle soon (and plan to, AFAIK), just that maybe AlibabaCloud's docs are overly/incorrectly prescriptive about how that limitation is applied. Or maybe their platform does apply it that way, but vanilla k8s doesn't.

If someone tests this and we can determine that each container's message is limited to 12KB/${numContainers} on all k8s platforms, that certainly bumps the priority for fixing this.

Apr 29 '22 15:04 imjasonh

If someone tests this and we can determine that each container's message is limited to 12KB/${numContainers} on all k8s platforms, that certainly bumps the priority for fixing this.

This is what is happening today and why I opened the bug. Adding an extra initContainer (even without any termination message setup, …) did reduce the size of the message for one container (and this is what happened in 0.32.0).

As a reproducer, see the following resources:

apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: generate-result
spec:
  params:
    - name: STRING_LENGTH
      description: Length of the string to create
    - name: STRING_CHAR
      description: Char to use when creating string
      type: string
      default: '.'
  results:
    - name: RESULT_STRING
      description: A result string
  steps:
    - name: gen-result
      image: bash:latest
      env:
        - name: PARAM_STRING_LENGTH
          value: $(params.STRING_LENGTH)
        - name: PARAM_STRING_CHAR
          value: $(params.STRING_CHAR)
      script: |
        #! /usr/bin/bash
        set -e
        len=$PARAM_STRING_LENGTH
        ch=$PARAM_STRING_CHAR
        printf '%*s' "$len" | tr ' ' "$ch" >>  $(results.RESULT_STRING.path)
---
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: print-result
spec:
  params:
    - name: TO_PRINT
      type: string
  steps:
    - name: print-result
      image: bash:latest
      env:
        - name: PARAM_TO_PRINT
          value: $(params.TO_PRINT)
      script: |
        #! /usr/bin/bash
        set -e
        echo $PARAM_TO_PRINT
---
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  labels:
    app: sample
  name: result-test
spec:
  params:
  - name: RESULT_STRING_LENGTH
    description: Length of string to generate for generate-result task
  - name: RESULT_STRING_CHAR
    description: Char to repeat in result string
    default: '.'
  tasks:
  - name: generate-result
    params:
    - name: STRING_LENGTH
      value: $(params.RESULT_STRING_LENGTH)
    - name: STRING_CHAR
      value: $(params.RESULT_STRING_CHAR)
    taskRef:
      kind: Task
      name: generate-result
  - name: print-result
    params:
    - name: TO_PRINT
      value: $(tasks.generate-result.results.RESULT_STRING)
    taskRef:
      kind: Task
      name: print-result

Running the following pipeline with a RESULT_STRING_LENGTH of 3000 (~3K) does fail on >= 0.32, but suceeds before.

tkn pipeline start result-test -p RESULT_STRING_LENGTH=3000 --use-param-defaults

Apr 29 '22 16:04 vdemeester

Oh I see, thanks for clarifying. In that case, it sounds like we should update the k8s docs so folks don't trip over this later. I'm sure I'll forget and need a reminder 😅

This also definitely bumps up the priority for larger results and getting off terminationMessages.

Apr 29 '22 16:04 imjasonh

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

Feb 27 '23 18:02 tekton-robot

/lifecycle frozen

Mar 08 '23 14:03 vdemeester

pipeline pipeline copied to clipboard

Results, TerminationMessage and Containers

pipeline
pipeline copied to clipboard