agones icon indicating copy to clipboard operation
agones copied to clipboard

Implement Server-Side Apply for GameServer updates

Open TBBle opened this issue 3 years ago • 12 comments

Is your feature request related to a problem? Please describe.

We have previously observed an issue an issue where the Agones Controller and SDKServer generate a lot of 409-conflict load on the API Server with UPDATE commands, because the Controller would send sequential updates with the same resourceVersion.

(I thought we lodged a bug report at the time, but I can't find it now... I guess the above comment plus a Slack conversation @pgilfillan mentioned having about this is all the follow-up we did.)

At the time we were unclear on why the Controller would be generating conflicts when it was also the most-recent updater of that object but I noticed the following in the Server-Side Apply documentation:

This is different from Client Side Apply, where outdated values which have been overwritten by other users are left in an applier's local config. These values only become accurate when the user updates that specific field, if ever, and an applier has no way of knowing whether their next apply will overwrite other users' changes.

I suspect that this is the problem, and that a faulty assumption has been made that the update sent from the the Controller is updating the local cache, but from observation of the apiserver logs this is not guaranteed.

I suspect this also would help with #1856.

Describe the solution you'd like

Move the Controller and SDKServer to use Server-Side Apply at least for GameServers.

This feature is beta as of 1.16, which already expired upstream (k8s) support in September 2020. GCP and AKS do not support 1.15 or older, and EKS will expire 1.15 in May 2021.

So this would either make Agones require k8s 1.16 or later, or would need to be optional/selected based on the k8s version running.

Describe alternatives you've considered

I was considering suggesting replacing the UPDATE with PATCH, but right now my understanding of the code-flow is that we rely on UPDATE's optimistic-locking via resourceVersion to avoid Controller and SDKServer conflicting over the Status field, specifically in the case where a SDKServer tries to mark itself Allocated at the same time as the Controller tries to mark that GameServer as Unhealthy (because allocation took too long to complete). One of those updates must fail, or the SDKServer and associated game server code may believe itself live and proceed to notify assigned players through a matchmaker, even though the GameServer and Pod will be deleted and cleaned up by the Controller soon after.

Simply replacing UPDATE with PATCH would move from optimistic locking to last-call-wins, so that's not a useable approach. We want a conflict in this case, but should not have conflicts in the usual case.

It's possible the PATCH workflow could be made to work safely, which would massively reduce the load on the API Server (the limiting factor in our scaling tests in May/June this year), but I suspect that would require some refactoring of the GameServer CRD to apply an SDKServer to separately identify its state from the desired state coming from the Controller.

Additional context

The relevant part of the linked comment:

... we ran a large-scale test on EKS and saw a lot of API Server load generated by UPDATE on the GameServer from the Controller and the SDKServer, most of which were not 429s but 409s because the sender was sending repeated updates for several minutes with a resource version that it had already sent an update for successfully. This was increasing the load on the API Server, and causing more delays in responses, (etcd was reporting 600ms or more update latency) and put the whole thing in a death-spiral.

Especially concerning, I recall we observed conflicts with the Controller and itself. It's been 6 months, but my recollection is that we traced a clear example through to the logic for dynamic port allocation, which sends a first UPDATE to sync port allocation state, and then immediately sends another UPDATE, the latter update flow is also shared with the static-port code-flow. We were seeing the second UPDATE with the same resourceVersion as the first UPDATE, so clearly the Controller had sent a new update but against an old state.

The rest of the flows in syncGameServer do not (from memory) trigger multiple UPDATE calls, as each of them depends on some external change, e.g., Pod creation or status update from the SDKServer, to continue to the next step.

As mentioned, this was a while ago, but I suspect we had planned, but never followed-up, to reproduce the conflict/performance issue using Agones own scaling tests, which might be why we never turned the comment into a fully-fledged issue report.

TBBle avatar Nov 21 '20 02:11 TBBle

Mind if I attempt to summarise, just so I can ensure I've understood the issue?

  • We would like more performance out of some aspects of Agones (I'm not quite sure which part wasn't performant , would be good to clarify, with numbers especially)
  • Our hypothesis is that one aspects of the performance issue is that there is a lot of traffic to the K8s API Server.
  • We think that using server side apply over the current methodology would help with reducing the traffic, and thereby increase performance

Did I understand correctly?

I can dig into a lot of the reasons why there are conflicts, how they are very normal for a k8s controller system, how the informer cache works, and why optimistic locking is extremely important to a lot of the system, but I wanted to make sure I understood the ticket first before going deeper in.

That also being said - there may be places that server side apply will work better, but we should be careful with it, and make sure we test appropriately at every stage.

markmandel avatar Nov 21 '20 05:11 markmandel

Digging in some more, server side apply is still in beta, and seems to be strongly in flux at the moment (even into 1.18.x): https://kubernetes.io/blog/2020/04/01/kubernetes-1.18-feature-server-side-apply-beta-2/ , which is concerning.

There also does not seem to be a way to do server side applications from client-go: https://github.com/kubernetes/kubernetes/issues/88091

markmandel avatar Nov 21 '20 05:11 markmandel

The core observation was that for a single GameServer resource, we would see at least two different UPDATE messages in a row from the Agones Controller with the same resourceVersion, without waiting for a response. The second message cannot be anything but a 409 Conflict, unless the first message somehow fails for reasons other than 409 Conflict, because the first message would change the `resourceVersion, invalidating the second message before it arrived.

The performance issue was that each of these UPDATE messages put load on the etcd under the k8s API Server, and with 3k or 15k (I forget, I'd have to check our Slack logs for precise numbers) GameServers active on a cluster, the API Server's was hitting some kind of limit (probably the max-mutating-requests-inflight) and rejecting further messages, which would just be resent later, causing a growing backlog in the Agones Controller and putting a hard cap on the number of GameServers that can become Ready each tick.

Nothing else was operating on that GameServer object at this point in its lifecycle, and the stats reported to Prometheus showed that almost all the load on the API Server was UPDATE events on GameServers, followed up UPDATE events on the GameServer's owned Pods. We could see the API Server hitting some kind of hard limitation in the same graphs.

PATCH does not generate this etcd load, (as I believe it does not require a hard sync across the etcd cluster) and I'm hoping that Server-Side Apply similarly does not generate that load, but does bring the benefits we need from optimising locking.

At later points in its lifecycle, we did see correct 409 Conflict responses when the Controller and the SDKServer both tried to update a GameServer at the same time (because the game binary reported 'Ready' to the SDK Server at the same time as the Controller saw a timeout and tried to mark the GameServer as Unhealthy). This situation is why I believe we can't just replace 'UPDATE' with 'PATCH' in the Controller and SDKServer logic, or I would have suggested it in June. It also should be very rare, the only reason we saw it so much was #1630 was causing the response to 'Allocated' from the game server binary to be delayed by about 30 seconds, lining it up nicely with a timeout in the Controller.

Perhaps instead we could use PATCH at those points where we don't need optimistic locking, i.e. when the Pod doesn't exist, which would remove this specific source of false 409 Conflicts, and cut maybe 20% (One stage of five in the GameServer lifecycle, I think) of this load from the API server.

So that's the first two dot-points.

The third one is indeed a guess that Server-Side Apply will help in this situation.

If we need more detailed numbers/graphs from the tests in June, I'll try and get them when I'm back in the office on Monday, but my understanding was at the time a colleague of mine raised this issue on a Slack channel, and the request was to try and reproduce these with the Agones existing load-tests, as they weren't being observed on GCP.

TBBle avatar Nov 21 '20 06:11 TBBle

Thanks for the write up! 👍

I feel like we need to take a step back here, and be problem first, not solution first, because I'm seeing some good ideas through here, but also lots of suggestions that would cause significant breakages in Agones because of misunderstandings of how certain parts of the system work.

If possible, can we get:

  • Get specific, reproducible cases of the issues you are seeing.
  • Get specifics about which versions of Agones, Kubernetes and platform of choice to see if its occurring on all systems.

From your write up above, it sounds like it might be possible there is a logic issue in syncGameServer, or other logic optimisations that might occur.

But unfortunately this is hard to determine this without breaking down the problem a bit further in a way that is reproducable, or doing some debugging to specifically confirm that the issue described is occurring as described. (I'm wondering if we could confirm this with a new unit test?).

markmandel avatar Nov 21 '20 18:11 markmandel

Something I've just recalled from June... We found that the scaling problems were much better (still happened, but higher ceiling) if we stopped adding data to the GameServer object through annotations, in our case it was the userIds who had been assigned to that GameServer by the match-making system.

GameServer data size is a scaling factor because UPDATE requires sending the whole object to the API Server, even if only a few fields are changing. PATCH would not have this scaling factor because it would only be changing a few fields, and Server-Side Apply would also not having this factor because the Controller (which generates all but one or two transitions) only sends the data it owns in an Apply operation, and this would be a fixed set of fields for the Controller.

However, this highlights that actually using Server-Side Apply will require some refactoring, as only the Controller should own the Status field, and the SDKServer would be updating a different field, e.g., StatusRequest (since some or all of the SDKServer-applied Status values are XRequest anyway), for the Controller to act upon.

TBBle avatar Nov 23 '20 09:11 TBBle

Server side apply is stable (GA) as of Kubernetes 1.22.

We are currently supporting 1.21 and I think it's time to reassess this request. We can consider moving now, or after we have updated our kubernetes support to 1.22 so that we are only relying on stable APIs.

/cc @ilkercelikyilmaz

roberthbailey avatar Dec 28 '21 19:12 roberthbailey

It's been a year and we haven't gotten to this yet. We have, however, updated Kubernetes a few more times and are now well into the range where we could rely on server side apply being available in the clusters where Agones is running.

roberthbailey avatar Dec 21 '22 05:12 roberthbailey

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

github-actions[bot] avatar Sep 15 '23 10:09 github-actions[bot]

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

github-actions[bot] avatar Nov 01 '23 10:11 github-actions[bot]

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

github-actions[bot] avatar Dec 15 '23 10:12 github-actions[bot]

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

github-actions[bot] avatar Jan 15 '24 10:01 github-actions[bot]

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

github-actions[bot] avatar Feb 15 '24 10:02 github-actions[bot]