community icon indicating copy to clipboard operation
community copied to clipboard

feat: autorecover from stuck situations

Open gerrnot opened this issue 1 year ago • 13 comments

Since on the helm mailing list there was no veto and only one positive feedback from a ranger representative (see references) among the many likes in various related bug threads, I am submitting this new HIP...

gerrnot avatar Jul 12 '24 08:07 gerrnot

Hey, I just want to say that you are awesome for driving this HIP! 💪 I think there is a dire need for this. I've seen many people loose trust in Helm because of things getting stuck irrecoverably, it would be so nice to redeem Helm. Also, this will be of great value for all the tools that depend on Helm (like Rancher).

I'll read your document and put down some comments. On a first read, it's just formatting things or typos I'm guessing. I might contribute something better once I've digested your suggestion.

Tell me if I can help!

lindhe avatar Jul 24 '24 06:07 lindhe

The --timout parameter gets a deeper meaning. Previously the --timout parameter only had an effect on the helm process running on the respective client. After implementation, the --timout parameter will be stored in the helm release object (secret) in k8s and have an indirect impact on possible parallel processes.

I think it is really clever reusing the --timeout flag for this purpose! Nice.

lindhe avatar Jul 24 '24 06:07 lindhe

I think this HIP has an additional benefit that should probably be mentioned in the document: it should make multi-user scenarios more robust in general, not just fix the situations where Helm gets stuck. What happens today if two clients run similar commands at the same time? If there is no mechanism to handle that, I believe this HIP would be a great addition to handle those scenarios gracefully.

lindhe avatar Jul 24 '24 07:07 lindhe

I think you are on the right track with your implementation, trying to keep this lock inside the Helm release object. That way, we do not introduce new objects to the Helm ecosystem, which is a great thing.

That said, I just want to mention that there is a Lease object in Kubernetes. If we were looking for alternatives, that might be a sensible thing to use.

lindhe avatar Jul 24 '24 07:07 lindhe

I think this HIP has an additional benefit that should probably be mentioned in the document: it should make multi-user scenarios more robust in general, not just fix the situations where Helm gets stuck. What happens today if two clients run similar commands at the same time? If there is no mechanism to handle that, I believe this HIP would be a great addition to handle those scenarios gracefully.

There is already a mechanism to detect concurrency, since a certain helm3 version. It is based on the "pending" states in the release object. While implemented with good intentions this has the consequence that one cannot recover anymore in bad cases. So I would not mention this in addition. Nevertheless, if you feel that additional benefits could be mentioned, feel free suggest an edit again, I will happily merge it if it increases the chances of the HIP.


One more thing to discuss that I realized during the implementation: Helm always uses the client time, e..g for the creation of the UPDATED field etc.

If I would now introduce using the k8s server time it could yield to weird timestamps deviations between other helm timestamp fields vs the ones used for locking. While I am still convinced that server time is better, this would make most sense only if this is refactored within the entire codebase, which I guess is out of scope and would require an independent HIP, what do you think?

gerrnot avatar Jul 24 '24 19:07 gerrnot

kubernetes has distributed locking primitives I think? Couldn't one of those be used?

kfox1111 avatar Jul 24 '24 19:07 kfox1111

@kfox1111 I assume you mean the Lease object as suggest by @lindhe.

Yes, that could be used, it is just that the release state is then distributed:

  1. the traditional helm release secret
  2. the lock state in k8s

Since I consider the locked-state as an attribute of the release state, the question is whether is is worth to store it somewhere else...

Nevertheless, it could increase the backwards compatibility if we see it as something that is honored if it is there and else not.

gerrnot avatar Jul 24 '24 19:07 gerrnot

One more thing to discuss that I realized during the implementation: Helm always uses the client time, e..g for the creation of the UPDATED field etc.

If I would now introduce using the k8s server time it could yield to weird timestamps deviations between other helm timestamp fields vs the ones used for locking. While I am still convinced that server time is better, this would make most sense only if this is refactored within the entire codebase, which I guess is out of scope and would require an independent HIP, what do you think?

Sounds like a bummer. I agree 1000% that server time is superior. I would go so far as to say that it is the only sensible thing to use. But I also agree that it probably belongs to another PR (unfortunately).

lindhe avatar Jul 30 '24 14:07 lindhe

original HIP draft - storing lock in k8s secret

  • A backwards compatible preview version is published as docker image (platform linux/amd64) for use of one's own risk: gernotfeichter/helm-autorecover:0.0.3 (based on v3.16)
  • Clarified why client time is used for LOCKED TILL.

possible change of storing the lock as native k8s lock

If agreement is reached that such a solution is better overall, I can invest the effort of rewriting the HIP and the draft code.

benefits

  • helm ls could be left as is:
    • no breaking change in the output
    • discussions based on which flag to show the new fields LOCKED TILL and SESSION ID are obsolete.
  • standardisation is always good (good library support for working with said object)
  • visibility/accessiblity: The lease object can be inspected by any user with sufficient permission, not even a helm binary is required for that.

drawbacks

  • The release state is split (one part is stored in a k8s secret, another part - at least temporarily is in a Lease object.).
    • Codewise: additional libraries need to be added and additional communication needs to happen to k8s to perform CRUD operations on the Lease object.
    • Userwise: The user needs to have additional knowledge in case of debugging concurrency issues.
  • The user under which the helm operations are performed require CRUD permissions on the Lease object, that is a potential breaking change. We could think of a fallback though, e.g. if insufficient privileges for lease -> ignore log a warning and continue, but this could introduce further risks/instabilities due to concurrency issues if ignored then, which was not the goal here. The goal was to keep concurrent modification protection while also allowing escaping stuck situations.

Based on this, @kfox1111 do you still think the k8s lock (Lease) should be used? I think it might be ok because I estimate that most people will run the helm client with a user with namespace admin permissions.

So I'm only asking for a final ok from a maintainer for the Lease solution before starting further work, I personally think the benefits of the Lease slightly outweigh the alternative solution.

gerrnot avatar Aug 06 '24 10:08 gerrnot

This is something we have been struggling with and would appreciate if it is merged. Can it be merged and included in the next release?

ajaybhat avatar Dec 05 '24 11:12 ajaybhat

@ajaybhat From my side yes, but I'm dependent on approvals/decisions by the maintainers.

gerrnot avatar Dec 05 '24 12:12 gerrnot

Considering work on Helm 4 has begun, maybe that's what we should target here?

lindhe avatar Dec 05 '24 13:12 lindhe

We are also struggling with this situation in our pipelines and would appreciate this being merged. Let me know if I can help somehow.

From my point of view the current implementation is fine the way it is, but using Leases also has its charm.

To me it seems more important just to decide on one and go on.

david0 avatar Jul 08 '25 10:07 david0