agones
agones copied to clipboard
GameServers in allocated state can respond to fleet updating
Is your feature request related to a problem? Please describe. a question about fleet update. it seems that the controller leaves GameServers in allocated state alone in the process of fleet updating, whether rolling or recreating. but some of our gameservers have long ttls (they are not specific to battles, but players can enter and leave at any time). so i`m wondering how to tell these gameservers to start the shutting down procedure (start by telling players to leave game or just kicking them out).
Describe the solution you'd like We could set an annotation / label (I think an annotation would be better?) on an Allocated GameServer that is left behind after a Fleet update - then the WatchGameServer SDK could pick it up?
Describe alternatives you've considered N/A
Additional context N/A
Super interesting problem! We were discussing this a bit internally and came up with some other user journeys. Would love feedback on the below design.
Objective
In a usage of Agones in which Allocated
GameServers
are utilised for a long time, such as a Lobby GameServer
, or potentially a session based GameServer
that is being reused multiple times in a row, it would be useful if that GameServer
process could be notified in some way if and when a Fleet has been updated such that an Allocated
GameServer
, may want to actively perform a graceful shutdown and release its resources such that it can be replaced by a new version.
Requirements
There are two scenarios of which to consider in which we might want to notify a GameServer
that a Fleet
has been updated that will need to have a solution:
Fleet Update (Rolling/Recreate)
This is a standard Fleet Rolling Update. When a rolling update starts, there should be an ability to notify any Allocated
GameServers
within that Fleet
that a rolling update has begun.
Out of scope: Notifying GameServers when a rolling update has finished.
Canary Deploy Through Multiple Fleets
If the user is updating a Fleet through a “Two (or more) Fleets Strategy” to canary a new version there is no rolling update, but instead the current Fleet is manually scaled down, while the replacement Fleet is manually scaled up.
In this scenario, the only indication that a Fleet is being replaced, is that the Fleet replica count could drop below the number of Allocated
GameServers
. In this scenario, the correct number to notify would equal:
Total Allocated GameServers - Fleet Replica Count
As an example - if a Fleet
has 3 Allocated
GameServers
, and the Fleet's
replica
count is modified to be 1
, two of the Allocated
GameServers
should be notified that they may wish to gracefully shut down.
It is worth noting that the GameServers
notified will need to follow the scheduling
setting of the Fleet to maintain the appropriate optimisation of resources.
Other Notes
This solution must also be backward compatible.
Ideally, if not utilised, this solution should not put any extra load in the Kubernetes control plane.
Ideally, the solution should be extensible, if other notification strategies are required over time.
Background
Design ideas
The general idea is that on either a Fleet
update that update the GameServer
template, or a Fleet
update that causes the replica count to drop below the number of currently Allocated
GameServers
- the Agones system should apply a user-provided set of labels and/or annotations to the Allocated
GameServers
.
This will serve two purposes:
- The
GameServer
SDKWatchGameServer
command could be utilised to pickup this annotation and/or label change to indicate the Fleet system change, and the gameserver binary could react accordingly. - This could also be used to proactively update
GameServer
labels, to effect change in allocation strategy - such as removing theGameServers
from being allocated while they perform their graceful shutdown.
Example yaml configuration:
apiVersion: "agones.dev/v1"
kind: Fleet
metadata:
name: simple-game-server
spec:
replicas: 2
allocationOverflow: # applied to the GameServerSet's number Allocated GameServers that are over the desired replicas
labels:
mykey: myvalue
version: "" # empty an existing label value
annotations:
otherkey: setthisvalue
template:
spec:
ports:
- name: default
containerPort: 7654
template:
spec:
containers:
- name: simple-game-server
image: gcr.io/agones-images/simple-game-server:0.13
This will work the same across Fleet resizing and Rolling Updates, in that the implementation will respond to the underlying GameServerSet
being shrunk in either scenario to a value smaller than the number of Allocated GameServers
.
The benefit of the above configuration is that if it is not applied, then no extra overhead exists for a Fleet. The user has to actively opt-in, and also provide what labels and/or annotations make sense for their Agones implementation.
This should also be developed behind a feature flag, I would suggest: FleetAllocatenOverflow
.
Outstanding Questions:
- Any objections to
allocationOverflow
as a name? Is it relatively intuitive to what it does?
Alternatives considered
Rather than (or in addition to) updating labels, we could instead send a SIGTERM
signal, much like is done on Pod shutdown. This would require the end user to actively utilise terminationGracePeriodSeconds
to specify how to keep the Allocated
GameServer
running before being shut down by Kubernetes.
This has several advantages:
- If the end user is already doing this to account for node shutdown, there is no repetition of work.
- This would automatically remove the
GameServer
from the allocation pool, since it skips anyGameServer
that has aDeletionTimeStamp
value.
But there are the following disadvantages:
- All GameServer binaries would be forced to respond to the
SIGTERM
signal as well as implement aterminationGracePeriodSeconds
configuration, otherwise they would always be shut down. With labels and annotations, they can be safely ignored. - With
SIGTERM
there is no indication of why theGameServer
is being asked to gracefully shutdown (it could be a node shutdown!) - whereas with a label/annotation, the end user can pass along whatever information they wish.
That all being said, if we get user feedback to have an option of deleting Allocated GameServers on Fleet update (to create a SIGTERM
signal), we could add it at a later date with the above design.
I had similar concerns. https://agones.slack.com/archives/C9DGM5DS8/p1647513867843959
Now, inevitably, when the match server makes an assignment request to agones, by entering the version in the selector,
Ask 'Please assign the latest version as priority', then
In the multi server, I wrote a code that watches the player.count
and shuts down if there is no user for a while (30 minutes).
The first thing I thought about before was how to use SIGTERM
suggested by @markmandel. But, as mark said, using SIGTERM
means you don't know "what" reason SIGTERM
was called for, and the game server must support it. (SIGTERM
comes and it is possible to check the reason by checking the annotation, but it looks unnatural.)
The 'onUpdate' method, on the other hand, seems to have a more natural flow.
Community feedback from my livestream when I wrote the above design:
Brittan J Thomas 4 days ago RE: "onUpdate.replica" does "replicaOverflow" make more sense?
replacing the key replica
even with just overflow
- I think that would work well. How do people feel?
@zmerlynn - This might interest you, since it's similar to disruption cases you've been exploring.
Updated the design a little bit.
Was looking at this again, and realised that I don't think we need two different configs for Fleet scaling and Fleet updates - they are essentially the same thing from a GameServer point of view -- we have too many Allocated GameServers.
One thing I did note though, is that when doing a rolling update, I saw the following on two GameServerSets
kubectl get gsset
NAME SCHEDULING DESIRED CURRENT ALLOCATED READY AGE
simple-game-server-hzh2h Packed 3 3 0 3 49s
simple-game-server-x4m5s Packed 1 2 2 0 2m51s
(simple-game-server-x4m5s
is the older GameServerSet
)
We can see that it's got a desired state of 1
, when I'm wondering if it should be 0? Might be tied to the 25% default burst / rolling update values -- so I'm going to ignore it for the initial pass of this, and see if it becomes an issue.
We can see that it's got a desired state of
1
, when I'm wondering if it should be 0? Might be tied to the 25% default burst / rolling update values -- so I'm going to ignore it for the initial pass of this, and see if it becomes an issue.
Right as I wrote this, i scaled the Fleet up to 10 items, and confirmed that it is, since we end up at:
$ kubectl get gsset
NAME SCHEDULING DESIRED CURRENT ALLOCATED READY AGE
simple-game-server-hzh2h Packed 8 8 0 8 18m
simple-game-server-x4m5s Packed 0 2 2 0 20m
TL;DR - no need to edit any of the GameServerSet
scaling code here.
Getting close on this... not sure if it'll make RC next week, but definitely the release after I reckon.
Release is next Tuesday. I have the functionality pretty much done, but still splitting PRs. This might slip into the next release... but we'll see how will go. I give it 60% that it will land in time.
Given that 1.30.0 is Tomorrow, this will slip to 1.31.0.
Just a small question, should we also apply this to the Reserved
GS of the old GSS? Similar as the Allocated
GS, I think it might also make sense to notify the Reserved
GS that the Fleet is being updated so customer can take action accordingly eg. might release its resources such that it can be replaced by a new version.
Given that 1.30.0 is Tomorrow, this will slip to 1.31.0
...and it slipped again, because GDC.
Just a small question, should we also apply this to the
Reserved
GS of the old GSS? Similar as theAllocated
GS, I think it might also make sense to notify theReserved
GS that the Fleet is being updated so customer can take action accordingly eg. might release its resources such that it can be replaced by a new version.
That's a good question. I'm leaning towards "no" , or at least "not yet" (and wait and see what feedback is)? Just because since reservation tends to be on a per-gameserver basis, each game server doesn't know if it should let go of a reservation or not if there's a new version - as if they all do, there are no gameservers for anyone to go to.
I think it's probably better to allow reservation -> Ready / Allocated flow to continue through, and then let the process manage things from there -- less opportunity for disruption.
But definitely would like to be open to perspectives from people actually using Reserved.
Makes sense to me. Also, thinking about the potential use cases, I feel that Reserved
is a more transient state than Allocated
so allow reservation -> Ready / Allocated flow to continue sounds reasonable, although by definition a GS can be reserved forever.