State locking bug fixes
This is a temporary placeholder issue.
- [ ] https://github.com/weaveworks/weave-gitops-interlock/issues/599
- [ ] https://github.com/weaveworks/weave-gitops-interlock/issues/596
- [ ] https://github.com/weaveworks/weave-gitops-interlock/issues/580
- [ ] https://github.com/weaveworks/tf-controller/issues/481
I'm currently running a special release that Kevin created. These images are:
docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
617912315635.dkr.ecr.us-west-2.amazonaws.com/tf-runner v0.15.1-er 6e0bd092d57c 12 days ago 218MB
bigkevmcd/tf-runner latest 6e0bd092d57c 12 days ago 218MB
617912315635.dkr.ecr.us-west-2.amazonaws.com/tf-controller v0.15.1-er 7879b9e5d5d0 12 days ago 106MB
bigkevmcd/tf-controller latest 7879b9e5d5d0 12 days ago 106MB
617912315635.dkr.ecr.us-west-2.amazonaws.com/tf-runner v0.15.1-base-er 74ca5524f232 12 days ago 100MB
bigkevmcd/tf-runner latest-base 74ca5524f232 12 days ago 100MB
bigkevmcd/tf-controller <none> 95309b02b3d5 12 days ago 111MB
There are still locking issues however. This is some of what I am seeing:
k get tf -n flux-system | grep rpc
posdev-bradley-core False error running Plan: rpc error: code = Internal desc = error acquiring the state lock: Lock Info:... 41d
posdev-ci-01-core False error running Plan: rpc error: code = Internal desc = error acquiring the state lock: Lock Info:... 21d
posdev-clifford-core False error running Plan: rpc error: code = Internal desc = error acquiring the state lock: Lock Info:... 166m
posdev-config False error running Plan: rpc error: code = Internal desc = error acquiring the state lock: Lock Info:... 45d
At this point my first stop is to go look at DynamoDB and see if there is a stale lock for say posdev-clifford-core and find that there is not. So next I do a “tfctl force-unlock”:
tfctl force-unlock posdev-clifford-core
Setting LockIdentifier to '' on resource flux-system/posdev-clifford-core
flux-system/posdev-clifford-core Patched and Reconcile requested
In a minute I will see my tf-runner job start again:
k get tf -n flux-system posdev-clifford-core
NAME READY STATUS AGE
posdev-clifford-core Unknown Reconciliation in progress 167m
And after a few minutes(tf-runner is really really slow now), I will see it’s locked again:
k get tf -n flux-system | grep rpc
posdev-brady-core False error running Plan: rpc error: code = Internal desc = error acquiring the state lock: Lock Info:... 19d
posdev-clifford-core False error running Plan: rpc error: code = Internal desc = error acquiring the state lock: Lock Info:... 176m
The thinking is that there are two issues:
1: The eviction case where we never receive SIGTERM
and
2: We receive the SIGTERM but it's not being delegating to a sub-process group