Is your feature request related to a problem? Please describe. a question about fleet update. it seems that the controller leaves GameServers in allocated state alone in the process of fleet updating, whether rolling or recreating. but some of our gameservers have long ttls (they are not specific to battles, but players can enter and leave at any time). so i`m wondering how to tell these gameservers to start the shutting down procedure (start by telling players to leave game or just kicking them out).

Describe the solution you'd like We could set an annotation / label (I think an annotation would be better?) on an Allocated GameServer that is left behind after a Fleet update - then the WatchGameServer SDK could pick it up?

Describe alternatives you've considered N/A

Additional context N/A

Jul 26 '22 01:07 keisni

Super interesting problem! We were discussing this a bit internally and came up with some other user journeys. Would love feedback on the below design.

Objective

In a usage of Agones in which Allocated GameServers are utilised for a long time, such as a Lobby GameServer, or potentially a session based GameServer that is being reused multiple times in a row, it would be useful if that GameServer process could be notified in some way if and when a Fleet has been updated such that an Allocated GameServer, may want to actively perform a graceful shutdown and release its resources such that it can be replaced by a new version.

Requirements

There are two scenarios of which to consider in which we might want to notify a GameServer that a Fleet has been updated that will need to have a solution:

Fleet Update (Rolling/Recreate)

This is a standard Fleet Rolling Update. When a rolling update starts, there should be an ability to notify any Allocated GameServers within that Fleet that a rolling update has begun.

Out of scope: Notifying GameServers when a rolling update has finished.

Canary Deploy Through Multiple Fleets

If the user is updating a Fleet through a “Two (or more) Fleets Strategy” to canary a new version there is no rolling update, but instead the current Fleet is manually scaled down, while the replacement Fleet is manually scaled up.

In this scenario, the only indication that a Fleet is being replaced, is that the Fleet replica count could drop below the number of Allocated GameServers. In this scenario, the correct number to notify would equal:

Total Allocated GameServers - Fleet Replica Count

As an example - if a Fleet has 3 Allocated GameServers, and the Fleet's replica count is modified to be 1, two of the Allocated GameServers should be notified that they may wish to gracefully shut down.

It is worth noting that the GameServers notified will need to follow the scheduling setting of the Fleet to maintain the appropriate optimisation of resources.

Other Notes

This solution must also be backward compatible.

Ideally, if not utilised, this solution should not put any extra load in the Kubernetes control plane.

Ideally, the solution should be extensible, if other notification strategies are required over time.

Background

https://agones.dev/site/docs/guides/fleet-updates/

Design ideas

The general idea is that on either a Fleet update that update the GameServer template, or a Fleet update that causes the replica count to drop below the number of currently Allocated GameServers - the Agones system should apply a user-provided set of labels and/or annotations to the Allocated GameServers.

This will serve two purposes:

The GameServer SDK WatchGameServer command could be utilised to pickup this annotation and/or label change to indicate the Fleet system change, and the gameserver binary could react accordingly.
This could also be used to proactively update GameServer labels, to effect change in allocation strategy - such as removing the GameServers from being allocated while they perform their graceful shutdown.

Example yaml configuration:

apiVersion: "agones.dev/v1"
kind: Fleet
metadata:
  name: simple-game-server
spec:
  replicas: 2
  allocationOverflow: # applied to the GameServerSet's number Allocated GameServers that are over the desired replicas
    labels:
      mykey: myvalue
      version: "" # empty an existing label value
    annotations:
      otherkey: setthisvalue
  template:
    spec:
      ports:
        - name: default
          containerPort: 7654
      template:
        spec:
          containers:
            - name: simple-game-server
              image: gcr.io/agones-images/simple-game-server:0.13

This will work the same across Fleet resizing and Rolling Updates, in that the implementation will respond to the underlying GameServerSet being shrunk in either scenario to a value smaller than the number of Allocated GameServers.

The benefit of the above configuration is that if it is not applied, then no extra overhead exists for a Fleet. The user has to actively opt-in, and also provide what labels and/or annotations make sense for their Agones implementation.

This should also be developed behind a feature flag, I would suggest: FleetAllocatenOverflow.

Outstanding Questions:

Any objections to allocationOverflow as a name? Is it relatively intuitive to what it does?

Alternatives considered

Rather than (or in addition to) updating labels, we could instead send a SIGTERM signal, much like is done on Pod shutdown. This would require the end user to actively utilise terminationGracePeriodSeconds to specify how to keep the Allocated GameServer running before being shut down by Kubernetes.

This has several advantages:

If the end user is already doing this to account for node shutdown, there is no repetition of work.
This would automatically remove the GameServer from the allocation pool, since it skips any GameServer that has a DeletionTimeStamp value.

But there are the following disadvantages:

All GameServer binaries would be forced to respond to the SIGTERM signal as well as implement a terminationGracePeriodSeconds configuration, otherwise they would always be shut down. With labels and annotations, they can be safely ignored.
With SIGTERM there is no indication of why the GameServer is being asked to gracefully shutdown (it could be a node shutdown!) - whereas with a label/annotation, the end user can pass along whatever information they wish.

That all being said, if we get user feedback to have an option of deleting Allocated GameServers on Fleet update (to create a SIGTERM signal), we could add it at a later date with the above design.

Aug 02 '22 22:08 markmandel

I had similar concerns. https://agones.slack.com/archives/C9DGM5DS8/p1647513867843959

Now, inevitably, when the match server makes an assignment request to agones, by entering the version in the selector, Ask 'Please assign the latest version as priority', then In the multi server, I wrote a code that watches the player.count and shuts down if there is no user for a while (30 minutes).

The first thing I thought about before was how to use SIGTERM suggested by @markmandel. But, as mark said, using SIGTERM means you don't know "what" reason SIGTERM was called for, and the game server must support it. (SIGTERM comes and it is possible to check the reason by checking the annotation, but it looks unnatural.)

The 'onUpdate' method, on the other hand, seems to have a more natural flow.

Aug 03 '22 17:08 JJhuk

Community feedback from my livestream when I wrote the above design:

Brittan J Thomas 4 days ago RE: "onUpdate.replica" does "replicaOverflow" make more sense?

replacing the key replica even with just overflow - I think that would work well. How do people feel?

Aug 10 '22 21:08 markmandel

@zmerlynn - This might interest you, since it's similar to disruption cases you've been exploring.

Nov 30 '22 08:11 roberthbailey

Updated the design a little bit.

Was looking at this again, and realised that I don't think we need two different configs for Fleet scaling and Fleet updates - they are essentially the same thing from a GameServer point of view -- we have too many Allocated GameServers.

One thing I did note though, is that when doing a rolling update, I saw the following on two GameServerSets

kubectl get gsset
NAME                       SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-game-server-hzh2h   Packed       3         3         0           3       49s
simple-game-server-x4m5s   Packed       1         2         2           0       2m51s

(simple-game-server-x4m5s is the older GameServerSet)

We can see that it's got a desired state of 1 , when I'm wondering if it should be 0? Might be tied to the 25% default burst / rolling update values -- so I'm going to ignore it for the initial pass of this, and see if it becomes an issue.

Dec 27 '22 21:12 markmandel

We can see that it's got a desired state of 1 , when I'm wondering if it should be 0? Might be tied to the 25% default burst / rolling update values -- so I'm going to ignore it for the initial pass of this, and see if it becomes an issue.

Right as I wrote this, i scaled the Fleet up to 10 items, and confirmed that it is, since we end up at:

$ kubectl get gsset
NAME                       SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-game-server-hzh2h   Packed       8         8         0           8       18m
simple-game-server-x4m5s   Packed       0         2         2           0       20m

TL;DR - no need to edit any of the GameServerSet scaling code here.

Dec 27 '22 21:12 markmandel

Getting close on this... not sure if it'll make RC next week, but definitely the release after I reckon.

Jan 06 '23 01:01 markmandel

Release is next Tuesday. I have the functionality pretty much done, but still splitting PRs. This might slip into the next release... but we'll see how will go. I give it 60% that it will land in time.

Feb 23 '23 17:02 markmandel

Given that 1.30.0 is Tomorrow, this will slip to 1.31.0.

Feb 27 '23 18:02 markmandel

Just a small question, should we also apply this to the Reserved GS of the old GSS? Similar as the Allocated GS, I think it might also make sense to notify the Reserved GS that the Fleet is being updated so customer can take action accordingly eg. might release its resources such that it can be replaced by a new version.

Apr 13 '23 17:04 gongmax

Given that 1.30.0 is Tomorrow, this will slip to 1.31.0

...and it slipped again, because GDC.

Apr 13 '23 19:04 markmandel

Just a small question, should we also apply this to the Reserved GS of the old GSS? Similar as the Allocated GS, I think it might also make sense to notify the Reserved GS that the Fleet is being updated so customer can take action accordingly eg. might release its resources such that it can be replaced by a new version.

That's a good question. I'm leaning towards "no" , or at least "not yet" (and wait and see what feedback is)? Just because since reservation tends to be on a per-gameserver basis, each game server doesn't know if it should let go of a reservation or not if there's a new version - as if they all do, there are no gameservers for anyone to go to.

I think it's probably better to allow reservation -> Ready / Allocated flow to continue through, and then let the process manage things from there -- less opportunity for disruption.

But definitely would like to be open to perspectives from people actually using Reserved.

Apr 13 '23 19:04 markmandel

Makes sense to me. Also, thinking about the potential use cases, I feel that Reserved is a more transient state than Allocated so allow reservation -> Ready / Allocated flow to continue sounds reasonable, although by definition a GS can be reserved forever.

Apr 13 '23 19:04 gongmax

agones
agones copied to clipboard

GameServers in allocated state can respond to fleet updating

Objective

Requirements

Fleet Update (Rolling/Recreate)

Canary Deploy Through Multiple Fleets

Other Notes

Background

Design ideas

Outstanding Questions:

Alternatives considered

agones agones copied to clipboard

GameServers in allocated state can respond to fleet updating

Objective

Requirements

Fleet Update (Rolling/Recreate)

Canary Deploy Through Multiple Fleets

Other Notes

Background

Design ideas

Outstanding Questions:

Alternatives considered

agones
agones copied to clipboard