pipeline icon indicating copy to clipboard operation
pipeline copied to clipboard

Old resource deletion hangs when `alpha` flag is off

Open chuangw6 opened this issue 3 years ago • 18 comments

Expected Behavior

Users should be able to delete any taskrun object using kubectl delete tr --all no matter if alpha feature flag is enabled.

Actual Behavior

After a new pipeline controller is deployed with alpha feature flag off, kubectl delete tr --all command hangs on the taskruns that worked fine under previous controller with alpha feature flag enabled.

Steps to Reproduce the Problem

  1. Clone the pipeline repo. Change stable feature flag to alpha in this file.
  2. Run ko apply -R -f config/
  3. After the new controller is running, run the example yaml. You can see the pipelinerun will be successfully executed.
  4. Back to step 1, change the flag back to stable
  5. Rerun step 2
  6. Run kubectl delete tr --all to delete all taskruns that were previously created. You will see the deletion is hanging.

Additional Info

  • Kubernetes version:

    Output of kubectl version:

    Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.1", GitCommit:"86ec240af8cbd1b60bcc4c03c20da9b98005b92e", GitTreeState:"clean", BuildDate:"2021-12-16T11:41:01Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.12-gke.1500", GitCommit:"6c11aec6ce32cf0d66a2631eed2eb49dd65c89f8", GitTreeState:"clean", BuildDate:"2022-05-11T09:25:37Z", GoVersion:"go1.16.15b7", Compiler:"gc", Platform:"linux/amd64"}
    WARNING: version difference between client (1.23) and server (1.21) exceeds the supported minor version skew of +/-1
    
  • Tekton Pipeline version:

    Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

    latest main
    

chuangw6 avatar Jul 22 '22 19:07 chuangw6

@chuangw6 any additionnal elements you can see ? Can you share the logs of the controller and webhook ?

vdemeester avatar Jul 25 '22 14:07 vdemeester

I can't reproduce this - deletion went through right away for me...

abayer avatar Jul 26 '22 14:07 abayer

Hi @vdemeester ,

If you search the keyworkd alpha in the controller log, you can see the complaints about it. This alpha flag was added here.

chuangw6 avatar Jul 26 '22 18:07 chuangw6

I can't reproduce this - deletion went through right away for me...

Hi @abayer , Instead of following step 4 & 5 to turn off alpha flag, can you try to just apply the latest release to see if you can reproduce? Thanks

kubectl apply -f https://storage.googleapis.com/tekton-releases/pipeline/previous/v0.38.0/release.yaml

chuangw6 avatar Jul 26 '22 18:07 chuangw6

@chuangw6 Aaaaah - that's fixed in #5215.

abayer avatar Jul 26 '22 19:07 abayer

@chuangw6 Aaaaah - that's fixed in #5215.

Thanks @abayer for the pointer. I pulled the latest main and followed the above steps. Now the controller doesn't have the complaints, but deleting the old taskruns still hangs.

Any idea?

chuangw6 avatar Jul 26 '22 19:07 chuangw6

I was looking at the webhook logs, which is where I saw the conversion errors. Could you post your new webhook log with latest main?

abayer avatar Jul 26 '22 19:07 abayer

Aaaah, I see https://gist.github.com/chuangw6/c1907a1d1f10a2962e9718b0c21ca4fd#file-pipeline-webhook-log-L342 - that's a problem for sure, but it's fixed in #5210.

abayer avatar Jul 26 '22 19:07 abayer

I was looking at the webhook logs, which is where I saw the conversion errors. Could you post your new webhook log with latest main?

sure!

chuangw6 avatar Jul 26 '22 20:07 chuangw6

Interesting - https://gist.github.com/chuangw6/7c68774399baaf689a7bc0aa189c5ded#file-new-webhook-log-L276 is there, with references to Chains, though I think that's just service account stuff?

abayer avatar Jul 26 '22 20:07 abayer

Interesting - https://gist.github.com/chuangw6/7c68774399baaf689a7bc0aa189c5ded#file-new-webhook-log-L276 is there,

Interesting. Why can pipeline webhook pick up chains-related stuff? lol

This is chains Controller log. It complaints about alpha flag too.

with references to Chains, though I think that's just service account stuff?

But the line[system:serviceaccounts system:serviceaccounts:tekton-chains system:authenticated] says it's authenticated.

chuangw6 avatar Jul 26 '22 20:07 chuangw6

It's on an UPDATE event, so it does validation and fails there. Chains tries to update the object, but cannot because some fields are behind alpha and stable is set. I am not sure why it would block deletion, except maybe if it fails to sign, put the object into the queue to retry, making it somehow un-deletable ?

@chuangw6 can you try to disable chains and see if that happens ?

vdemeester avatar Jul 27 '22 08:07 vdemeester

It's on an UPDATE event, so it does validation and fails there. Chains tries to update the object, but cannot because some fields are behind alpha and stable is set. I am not sure why it would block deletion, except maybe if it fails to sign, put the object into the queue to retry, making it somehow un-deletable ?

@chuangw6 can you try to disable chains and see if that happens ?

Hi @vdemeester, I tried two things:

  1. Had a new cluster with only pipeline installed. No resource deletion hanging problem at all under either stable or alpha flag.

  2. Tried to disable chains in my old cluster where both chains and pipelines are installed. I tried to disable chains by two ways

    • scale the controller 0.
    • delete all the resources under chains namespace by running ko delete -R -f config/.

    Neither of them works. And things even get tricker: after chains is disabled, I cannot delete the taskruns after alpha is turn on again.

chuangw6 avatar Jul 28 '22 15:07 chuangw6

It's on an UPDATE event, so it does validation and fails there. Chains tries to update the object, but cannot because some fields are behind alpha and stable is set. I am not sure why it would block deletion, except maybe if it fails to sign, put the object into the queue to retry, making it somehow un-deletable ?

^ @vdemeester this is correct. Chains uses a finalizer to make sure that Run objects don't disappear before they can be attested. Normally, the finalizer is removed once chains has processed the object, but because the alpha flag was disabled, chains can't complete its reconcile (probably when trying to add the chains.tekton.dev/signed annotation) so the finalizer is never removed.

I don't think chains is unique here - any client that tries to update a beta object with alpha fields would fail in a similar way. This kind of unexpected behavior was what I was worried about in https://github.com/tektoncd/pipeline/pull/5005#issuecomment-1170484289 (cc @lbernick @afrittoli)

wlynch avatar Jul 28 '22 19:07 wlynch

^ @vdemeester this is correct. Chains uses a finalizer to make sure that Run objects don't disappear before they can be attested. Normally, the finalizer is removed once chains has processed the object, but because the alpha flag was disabled, chains can't complete its reconcile (probably when trying to add the chains.tekton.dev/signed annotation) so the finalizer is never removed.

Thanks @wlynch . That explains why it's still not deletable after chains is disabled - because the taskruns have been added a finalizer.

Quick question: When a client like chains is trying to add a finalizer to the object, it's an UPDATE operation. And whenever update operation happens, pipeline's webhook will be triggered in which alpha flag is checked. Is that correct understanding?

chuangw6 avatar Jul 29 '22 01:07 chuangw6

FYI: For those who want to delete the taskruns that hang because of the finalizer, you can use the patch command to remove finalizer, which can help remove resources immediately.

kubectl patch tr/<NAME-OF-THE-TASKRUN-OBJECT> --type json --patch='[ { "op": "remove", "path": "/metadata/finalizers" } ]'

chuangw6 avatar Jul 29 '22 01:07 chuangw6

@wlynch this is solely on the v1beta1 object (with or without the api fields to alpha or not). Quick question, is chains using PATCH ? I think we "could" detect it's a PATCH on metadata and not run the validation. In any case, this is one "potential" problem we have with enable-api-fields, if users enable alpha API to try something and switch back to stable with an object with alpha api fields, that are, thus, not valid anymore. Maybe the webhook could "annotate" objects depending on whether it adds enable-api-fields to alpha (or to any value), so that on later validation, we could act based on that ? 🤔

vdemeester avatar Jul 29 '22 07:07 vdemeester

Quick question: When a client like chains is trying to add a finalizer to the object, it's an UPDATE operation. And whenever update operation happens, pipeline's webhook will be triggered in which alpha flag is checked. Is that correct understanding?

I think it really depends on when you switch back the enable-api-fields to stable. If it's during the execution of a TaskRun, then chains will pick it up once done and.. updating it will fail the validation. I think it also might depend when chains put the finalizer in.

vdemeester avatar Jul 29 '22 07:07 vdemeester

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot avatar Oct 27 '22 08:10 tekton-robot

I think this should be fixed by https://github.com/tektoncd/pipeline/pull/5610; @chuangw6 are you able to check whether this is still happening?

lbernick avatar Oct 27 '22 14:10 lbernick

Hey @lbernick ,

I just tried on the latest main. It seems the issue still persists because of the reason above mentioned by @wlynch.

^ @vdemeester this is correct. Chains uses a finalizer to make sure that Run objects don't disappear before they can be attested. Normally, the finalizer is removed once chains has processed the object, but because the alpha flag was disabled, chains can't complete its reconcile (probably when trying to add the chains.tekton.dev/signed annotation) so the finalizer is never removed. I don't think chains is unique here - any client that tries to update a beta object with alpha fields would fail in a similar way. This kind of unexpected behavior was what I was worried about in https://github.com/tektoncd/pipeline/pull/5005#issuecomment-1170484289 (cc @lbernick @afrittoli)

chuangw6 avatar Nov 09 '22 15:11 chuangw6

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten with a justification. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot avatar Dec 09 '22 16:12 tekton-robot

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen with a justification. Mark the issue as fresh with /remove-lifecycle rotten with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

tekton-robot avatar Jan 08 '23 16:01 tekton-robot

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen with a justification. Mark the issue as fresh with /remove-lifecycle rotten with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot avatar Jan 08 '23 16:01 tekton-robot