pulumi-kubernetes
pulumi-kubernetes copied to clipboard
Pulumi State Drifts from Kubernetes Cluster State
What happened?
A container image tag was changed and Pulumi was in the process of applying that change to the Kubernetes cluster, but the process was interrupted. As a result, Pulumi never committed the change to its own state even though it was already pushed to the cluster. As a result, Pulumi state drifts away from actual in-cluster state, and future runs of Pulumi won't pick up changes as expected.
Example
- We update the image tag in the Pulumi stack config manually (the image building and pushing process is managed outside Pulumi)
- We run
pulumi up
- Pulumi sees the desired target image mismatches with the current image by checking its own state and comparing against it
- Pulumi will appropriately start a rollout on the Deployment
- Pulumi updates the Deployment config in the k8s cluster to point to the new target image. THIS IS WHERE IT SHOULD ALSO UPDATE THE PULUMI STATE BUT DOES NOT.
- At this point changes have been committed to the k8s cluster but not to the Pulumi state
- Pulumi deployment gets interrupted for any reason -> k8s state has been modified but Pulumi state has not.
- On subsequent Pulumi runs, Pulumi is not aware if the change that was applied to the k8s cluster, as it never committed the change to its own state due to the process being interrupted.
Output of pulumi about
Unknown (filing on behalf of a customer)
Additional context
No response
Contributing
Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).
@scottslowe This sounds expected. If pulumi is interrupted, it won't have a chance to write state and wouldn't even know the outcome of the operation. What do you expect to behave differently?
@mikhailshilkov It seems to me there's an atomicity issue here. Pulumi is updating Kubernetes (which triggers a rollout on the Deployment), but doesn't appear to be updating its own state until some point afterward (perhaps waiting for the rollout to complete?). In that time period---between when Kubernetes has the desired state and Pulumi does not---there's room for the configuration to drift (i.e., Pulumi gets interrupted, Kubernetes has been updated but Pulumi has not). From my (perhaps naive) point of view, we should be updating Pulumi's state at the same time (or as close as possible) as Kubernetes' state is being updated. Is that not the case currently?
Looks like this regards https://github.com/pulumi (not pulumi-kubernetes) Note that pulumi writes checkpoints and that the managed backend is more robust than the self-hosted one. Also note that after a sigint, pulumi will write everything down (a second sigint or a sigterm will immediately close the program).
Hey, just dropping in to mention I'm the person who opened a discussion about this with @scottslowe on Slack and he opened this issue on my behalf.
I've mostly been using the managed backend in the projects I've worked on with Pulumi, but the state drift can occur on both managed and self-hosted backends from my observations. The gist of the issue seems to have been described correctly (Pulumi can and will write changes to the k8s state before committing any trace of those changes to its own state, which in some scenarios leads to state drift).
To me it seems like the issue could be resolved by maintaining some kind of a write-ahead log in the pulumi state which can be used to pick up interrupted deployments (or at the very least properly clean them up). Perhaps this is even done, but it gets incorrectly rolled back on a failed deployment (even though state has already been committed and not cleaned up)?
Another potentially useful observation is that this has never happened if pulumi isn't configured to wait for the resource to become live. I would not be ready to say the waiting is what causes the bug however, it could simply be a factor that increases the timing window where it can occur, and deployments where pulumi isn't configured to wait still has the same bug but just a much shorter timing window where it can actually occur.
I'm not sure which component this issue really belongs to under pulumi, but it has been an issue for long enough that e.g. recently I was not able to recommend a customer to use the pulumi k8s operator to automate their pulumi stack deployments due to the only way to fix this state drift is somewhat manual. The best automated alternative that I'm aware of would be to spam pulumi refresh before/after each deployment, but I suspect there's a reason why pulumi doesn't already do that by default, and that reason is because it leads to detecting changes which don't tangibly matter in most cases & you'll have to maintain a set of ignore changes rules on every resource. To begin with I'm not sure if refresh would fix it, because refresh only refreshes the state of tracked resources, and sometimes the problem is that pulumi is not tracking them.
I can't promise I'll have too much time to put into this before next year, but I'd appreciate if someone can digest this issue into something potentially actionable I can contribute towards when I next run into it (giving me a reason to do so).
@scottslowe @awoimbee @JoaRiski the Pulumi engine is currently only able to update state once it gets a response from the provider's RPC handler. With this unary model there will always be a race condition where hard interruptions (kill -9
) can leave state out of sync. https://github.com/pulumi/pulumi/issues/15958 and https://github.com/pulumi/pulumi/issues/5210 track work in this area.
Under normal circumstances, when the RPC completed but the resource didn't become ready, we return an ErrorResourceInitFailed -- see here. That should take care of updating Pulumi's state to match the cluster.
Soft interruptions (ctrl-C
) invoke the provider's Cancel
handler to give it an opportunity to return early from its work. ~~I do see a potential bug where our cancellation logic (and timeout handling) doesn't seem to return ErrorResourceInitFailed as you would expect.~~ Edit: after looking more and attempting to repro this, we do appear to be handling cancelation correctly.
A couple things that would be helpful to know:
- Are there any resource types in particular where you happen to see this more frequently?
- Are the interruptions typically due to
ctrl-C
, timeouts, or something else?
The conversation seems to have died out here after @blampe's explanation. If anyone watching this issue is still wrestling with premature cancellation resulting in drift and can elaborate on specific scenarios (ie. answers to @blampe's questions about specific resources affected and the source of the interruptions), please open a follow up ticket.