pipeline icon indicating copy to clipboard operation
pipeline copied to clipboard

Enhance user-facing Condition/ContainerTerminated Reasons with backwards compatibility

Open JeromeJu opened this issue 2 years ago • 13 comments

We have had use cases as specified in #7223 and #7396 that downstream users and Tekton providers would want to have more granular error reasons. As discussed at the API WG and #7390, the existing CRDs {Step, TaskRun, PipelineRun} rely on knative.dev/pkg/apis#Condition for conveying user-facing reasons.

However these Condition reasons are being relied on for our Pipeline For example, dashboard will rely on the error reason to determine the step status.

Blocked issues: Surfacing of actual Termination Reason in Step Status Surface TaskRun failure reason when Step or Pod fails Make user-facing condition messages more granular as canonical error codes

Current workaround: Option 1: Add a new provisional field in the CRDs for {Step, TaskRun, PipelineRun} Condition/ContainerTerminatedState status Option 2: Add it to the condition/containerTerminated message

Future work plan: With a major version bump, i.e. in v2alpha1, we would be able to change the Condition/ContainerTerminated Reason

co-authored with: @renzodavid9

JeromeJu avatar Dec 28 '23 20:12 JeromeJu

/assign @renzodavid9

JeromeJu avatar Dec 28 '23 20:12 JeromeJu

@JeromeJu: GitHub didn't allow me to assign the following users: renzodavid9.

Note that only tektoncd members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

/assign @renzodavid9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot avatar Dec 28 '23 20:12 tekton-robot

/assign me

renzodavid9 avatar Dec 28 '23 20:12 renzodavid9

@renzodavid9: GitHub didn't allow me to assign the following users: me.

Note that only tektoncd members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

/assign me

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot avatar Dec 28 '23 20:12 tekton-robot

/assign

renzodavid9 avatar Dec 28 '23 21:12 renzodavid9

Through offline discussion with @renzodavid9 , we've examined that it might not be a feasible way to create a new field in the pod container status to update the container termination reason since these are all knative rather than Tekton conditions.

Pros Cons Example
Add a new field More straightforward when for the transition to replace the condition Reasons during the next major version bump #7545
Add to the condition status message Easier to maintain Users might not be aware of the change

JeromeJu avatar Jan 04 '24 20:01 JeromeJu

One other question is that are making changes to PipelineRunStatus Condition Reason and PipelineRunStatus Condition Reason also considered as breaking changes?

It looks like that Dashboard relies on the ConditionStatus on this to determine the completion of a PipelineRun (similarly for TaskRuns). Does this imply that the changes made to PipelineRunStatus/TaskRunStatus Reasons will not be breaking changes for Dashboard and others that consumes such status?

cc @AlanGreene @vdemeester @tektoncd/core-collaborators 🙏

JeromeJu avatar Jan 08 '24 15:01 JeromeJu

The Dashboard also uses the reason values in a number of places, so changes to these would be considered a breaking change. See response to similar question on another issue proposing changes to the reason values for steps: https://github.com/tektoncd/pipeline/pull/7390#issuecomment-1850954066

Changes for PipelineRun and TaskRun would have similar issues.

Here are just a few examples of the Dashboard's current use of the reason field for PipelineRun and TaskRun:

  • https://github.com/tektoncd/dashboard/blob/bd39f464faa0a44a0379afbfdf1a54371d132782/src/containers/TaskRun/TaskRun.jsx#L317
  • https://github.com/tektoncd/dashboard/blob/bd39f464faa0a44a0379afbfdf1a54371d132782/packages/components/src/components/RunHeader/RunHeader.scss#L92
  • https://github.com/tektoncd/dashboard/blob/bd39f464faa0a44a0379afbfdf1a54371d132782/packages/components/src/components/TaskTree/TaskTree.jsx#L66
  • https://github.com/tektoncd/dashboard/blob/bd39f464faa0a44a0379afbfdf1a54371d132782/packages/components/src/components/StatusIcon/StatusIcon.jsx#L65

AlanGreene avatar Jan 08 '24 15:01 AlanGreene

The Dashboard also uses the reason values in a number of places, so changes to these would be considered a breaking change. Changes for PipelineRun and TaskRun would have similar issues.

Thanks @AlanGreene for confirming the understanding on this 🙏 Would there also be a preference on the 2 options, adding a new field in PipelineRun/TaskRun status for enhancing via a new Reason field or adding the granular reasons into the Condition Status message?

JeromeJu avatar Jan 08 '24 15:01 JeromeJu

It depends on whether this information is only intended for user consumption or if it's also intended to be used by clients / automation.

If it's intended for clients I don't think the message is a good option. Today I believe most clients treat it as an opaque string and avoid parsing it. If we change this we would need to define a clear API contract for its structure, content, etc. and guard against unexpected changes.

AlanGreene avatar Jan 08 '24 15:01 AlanGreene

Following on https://github.com/tektoncd/pipeline/issues/7539#issuecomment-1877702017, and according with offline conversation with @JeromeJu , to avoid extending the ContainerStateTerminated from the k8s lib, we can instead modify the StepStatus from Tekton to add the new field. It will look something like this:

apiVersion: tekton.dev/v1
kind: TaskRun
metadata:
  name: task1
...
status:
  steps:
  - container: step-first
    imageID: docker-pullable://busybox@sha256:ba76950ac9eaa407512c9d859cea48114eeff8a6f12ebaa5d32ce79d4a017dd8
    name: first
    terminationReason: AnyReason # <- This will be the new field
    terminated:
      containerID: docker://aaa111222
      exitCode: 1
      finishedAt: "2024-01-08T16:27:38Z"
      reason: Error
      startedAt: "2024-01-08T16:27:38Z"

renzodavid9 avatar Jan 08 '24 21:01 renzodavid9

Following on #7539 (comment), and according with offline conversation with @JeromeJu , to avoid extending the ContainerStateTerminated from the k8s lib, we can instead modify the StepStatus from Tekton to add the new field. It will look something like this:

apiVersion: tekton.dev/v1
kind: TaskRun
metadata:
  name: task1
...
status:
  steps:
  - container: step-first
    imageID: docker-pullable://busybox@sha256:ba76950ac9eaa407512c9d859cea48114eeff8a6f12ebaa5d32ce79d4a017dd8
    name: first
    terminationReason: AnyReason # <- This will be the new field
    terminated:
      containerID: docker://aaa111222
      exitCode: 1
      finishedAt: "2024-01-08T16:27:38Z"
      reason: Error
      startedAt: "2024-01-08T16:27:38Z"

Thansk @renzodavid9 for the discussion and this! This looks like the way to go as consensus reached during the API WG 🙏

The next AI for us might be to create a PR RFC(requesting for feedback). i.e.

  • https://github.com/tektoncd/pipeline/pull/7545

And we could together have some documentations in the deprecations md and elsewhere to claim our migration plan when moving to v2alpha1.

JeromeJu avatar Jan 10 '24 00:01 JeromeJu

Hello there!

The open PR for this currently has ~19 tests failing (unit + integration). This is mainly due to the new field, status. steps[].terminationReason, is expected to have an specific value for some tests where we check a TaskRun status against a desired one. In offline discussion with @JeromeJu, it sounds the proper way is to update the tests to add the new field with the expected value + any extra work needed, taking into account that we might push more changes in the same PR.

Other alternative discussed:

  • For the current tests failing, during comparison, ignore the status. steps[].terminationReason field, but this might not be ideal, looks better to include the new field in the current tests

Making the observation in case we have comments from members in the community. Thanks a lot! 😄

renzodavid9 avatar Jan 18 '24 21:01 renzodavid9