agones icon indicating copy to clipboard operation
agones copied to clipboard

[Feature Proposal] Add Helm option for spec.strategy.type for controller and extensions-deployment

Open igooch opened this issue 6 months ago • 0 comments

Is your feature request related to a problem? Please describe.

The current Agones upgrading documentation has the caveat:

Regardless of the type of installation, there will be a brief ~20-30 second period during upgrade when the controller service switches to the new controller endpoint that service is unable to connect to the new controller. The SDK servers are still functional during this time, however the controller will not be able to receive requests such as creating new game servers. Be sure to include retry logic with back-off in your application logic to account for the error Internal error occurred: failed calling webhook "mutations.agones.dev": failed to call webhook: Post "[https://agones-controller-service.agones-system.svc:443/mutate?timeout=10s](https://agones-controller-service.agones-system.svc/mutate?timeout=10s)": no endpoints available for service "agones-controller-service"" during this time.

This is due to the Recreate strategy for the controller and extensions-deployment.

https://github.com/googleforgames/agones/blob/eb69e7f6c92ffed70e8e69b66436986949666df5/install/helm/agones/templates/controller.yaml#L38-L39 https://github.com/googleforgames/agones/blob/eb69e7f6c92ffed70e8e69b66436986949666df5/install/helm/agones/templates/extensions-deployment.yaml#L34-L35

By the Kubernetes definition, All existing Pods are killed before new ones are created when .spec.strategy.type==Recreate.. Since all pods are killed, there is the brief ~20-30 second period where there are no endpoints, thus the webhook (used for creating or allocating new game servers) temporarily fails.

Describe the solution you'd like

Add a flag, similar to what we have for the agones.allocator.updateStrategy

https://github.com/googleforgames/agones/blob/eb69e7f6c92ffed70e8e69b66436986949666df5/install/helm/agones/templates/service/allocation.yaml#L154-L157

so that the user can specify the spec.strategy.type for either Recreate or RollingUpdate. Unlike the agones.allocator.updateStrategy the controller and extensions-deployment should default to Recreate to maintain their current behavior.

In the documentation we should specify that RollingUpdate for these controllers should not be used with the Helm generated TLS certificates (more details below), and only with cert-manager or self-signed certificates that are not rotated during upgrade.

Describe alternatives you've considered

When switching the controller and extensions from "Recreate" to "RollingUpdate" the "no endpoints available for service" error no longer appears. However instead there is the error:

Internal error occurred: failed calling webhook "mutations.agones.dev": failed to call webhook: Post "https://agones-controller-service.agones-system.svc:443/mutate?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "admission-controller-ca")

Based on https://github.com/helm/helm/issues/10731, this is expected behavior from Helm "until the pods refresh the certificate, the kubernetes apiserver will get a TLS error calling the admissions web hook." This is for the Helm generated TLS certificates. If the user uses cert-manager or their own certificates, and does not rotate the certificates at the same time as the upgrade, this should not be an issue. This means that RollingUpdate should not be hardcoded, and should instead be an updatable field.

Additional context

Link to the Agones Feature Proposal (if any)

Discussion Link (if any)

igooch avatar Jun 02 '25 20:06 igooch