tofu-controller icon indicating copy to clipboard operation
tofu-controller copied to clipboard

State locking bug fixes

Open lasomethingsomething opened this issue 2 years ago • 1 comments

This is a temporary placeholder issue.

  • [ ] https://github.com/weaveworks/weave-gitops-interlock/issues/599
  • [ ] https://github.com/weaveworks/weave-gitops-interlock/issues/596
  • [ ] https://github.com/weaveworks/weave-gitops-interlock/issues/580
  • [ ] https://github.com/weaveworks/tf-controller/issues/481

lasomethingsomething avatar Nov 15 '23 16:11 lasomethingsomething

I'm currently running a special release that Kevin created. These images are:

docker images
REPOSITORY                                                   TAG               IMAGE ID       CREATED       SIZE
617912315635.dkr.ecr.us-west-2.amazonaws.com/tf-runner       v0.15.1-er        6e0bd092d57c   12 days ago   218MB
bigkevmcd/tf-runner                                          latest            6e0bd092d57c   12 days ago   218MB
617912315635.dkr.ecr.us-west-2.amazonaws.com/tf-controller   v0.15.1-er        7879b9e5d5d0   12 days ago   106MB
bigkevmcd/tf-controller                                      latest            7879b9e5d5d0   12 days ago   106MB
617912315635.dkr.ecr.us-west-2.amazonaws.com/tf-runner       v0.15.1-base-er   74ca5524f232   12 days ago   100MB
bigkevmcd/tf-runner                                          latest-base       74ca5524f232   12 days ago   100MB
bigkevmcd/tf-controller                                      <none>            95309b02b3d5   12 days ago   111MB

There are still locking issues however. This is some of what I am seeing:

k get tf -n flux-system | grep rpc  
posdev-bradley-core      False     error running Plan: rpc error: code = Internal desc = error acquiring the state lock: Lock Info:...   41d
posdev-ci-01-core        False     error running Plan: rpc error: code = Internal desc = error acquiring the state lock: Lock Info:...   21d
posdev-clifford-core     False     error running Plan: rpc error: code = Internal desc = error acquiring the state lock: Lock Info:...   166m
posdev-config            False     error running Plan: rpc error: code = Internal desc = error acquiring the state lock: Lock Info:...   45d

At this point my first stop is to go look at DynamoDB and see if there is a stale lock for say posdev-clifford-core and find that there is not. So next I do a “tfctl force-unlock”:

tfctl force-unlock posdev-clifford-core
Setting LockIdentifier to '' on resource flux-system/posdev-clifford-core
flux-system/posdev-clifford-core Patched and Reconcile requested

In a minute I will see my tf-runner job start again:

k get tf -n flux-system posdev-clifford-core
NAME                   READY     STATUS                       AGE
posdev-clifford-core   Unknown   Reconciliation in progress   167m

And after a few minutes(tf-runner is really really slow now), I will see it’s locked again:

k get tf -n flux-system  | grep rpc
posdev-brady-core        False     error running Plan: rpc error: code = Internal desc = error acquiring the state lock: Lock Info:...   19d
posdev-clifford-core     False     error running Plan: rpc error: code = Internal desc = error acquiring the state lock: Lock Info:...   176m

The thinking is that there are two issues:

1: The eviction case where we never receive SIGTERM

and

2: We receive the SIGTERM but it's not being delegating to a sub-process group

cliffordt3 avatar Dec 06 '23 15:12 cliffordt3