agones Allow graceful restarts of game server containers instead of shutdown

Is your feature request related to a problem? Please describe. Pod churn(creating and deleting pods) is a relatively expensive operation. In cases where pod churn is particularly high, churn can get in the way of a fleet's ability to scale since both are contending for the same resources.

Describe the solution you'd like Have some value in the game server health configuration like AllowGracefulContainerRestart. If this flag is set to true, instead of calling Shutdown() on the sdk, we'd just exit out of the game server process with a zero exit code. The health controller would not mark this as unhealthy(as long as we had a non-zero exit code). The game server would then go back to the Starting state and would run normally from there.

In the case where a game server exited with a non-zero exit code or exited before being allocated, that would indicate an unhealthy game server and the pod should be restarted.

This would remove the costly pod churn operation, while still allowing us to have a brand new game server container.

Describe alternatives you've considered

"Just calling Ready again": this is definitely a realistic option and is described in the docs. The issue is that it means that we have to be very diligent about ensuring we're actually clearing state between game server. Accidentally carrying over state can have privacy and security concerns. That seems like more of an optimization around the time to start a new game server being relatively high, which is not quite the case that this issue is trying to work around.
Wrapping the game server with another process that sits in the same container, and acts as a bridge between agones and our game server: also viable, but more difficult and burns more resources. It also leaves us in a similar situation where if we muck with the container's state in any way, so we mitigate some of the issue, but not all of it.

Additional context This would be targeting a fairly specific case where the cost to start a new game server is very low, but the impact of starting a new pod is relatively high.

Oct 28 '22 01:10 austin-space

This is an interesting idea 🤔 and I can see the value!

For reference, this is where we track container failure and move containers to Unhealthy: https://github.com/googleforgames/agones/blob/d70793c00abc4774265eecb4ff07bbcd3fde2250/pkg/gameservers/health.go#L135-L148

At first, I didn't think that the K8s API actually gave us access to status codes when a termination occurs, but I was wrong!

As we can see here: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.25/#containerstateterminated-v1-core

There is an exitCode provided, so in theory, this does seem possible!

Additional thought/detail/questions:

Does an exit(0) cause a Pod to self terminate? I can never remember if it does (I think it might?) but if it does, we could specify a particular exit code value (not already utilised by K8s/containers) that allows a restart to occur (which may actually be an easier way to do the above, rather than have a special config value, have a specific exit code.
On restart of the container should Agones automatically move the container back to Ready? I lean towards "No", just because I prefer things to be explicit.
Or maybe on restart it should be moved back to Scheduled? To take it out of any kind of Allocation pool. This could also be managed by label selectors, but might be worth considering.

Oct 28 '22 17:10 markmandel

In response to the questions:

I don't think that causes a self terminate. I was seeing if this was possible already by exiting with code 0 and kubernetes was restarting the container gracefully, just in time to be stomped on by the health controller. Either solution works for me, the flag feels a little more intentional, but the exit code version can be introduced without changing the api.
Agreed on that it shouldn't be moved back to Ready.
Yeah, Scheduled seems like the move.

Ideally it would be really nice to have a Restart() method on the sdk that would indicate that this is all about to happen and initiate the move to Scheduled, but that would also be a much larger change.

Oct 28 '22 17:10 austin-space

Ideally it would be really nice to have a Restart() method on the sdk that would indicate that this is all about to happen and initiate the move to Scheduled, but that would also be a much larger change.

Not a huge change. Actually, that might be a better solution in many ways.

Since if the GameServer is before being Ready, we allow the container to restart, and don't move it to Unhealthy at that point. There might be a race condition though on moving to Scheduled through the SDK (since that is all asynchronously processed) and then when to actually restart the container process 🤔

One solution is that the game server binary watches for that state change from Allocated -> Scheduled to occur and then restarts. Is that too much integration work?

I actually think it's the least amount of work on the Agones side, since we already test that restarts are allowed before Ready.

Oct 28 '22 17:10 markmandel

That's totally a manageable amount of integration! If that's an option, then that would actually be my preferred solution mostly because it makes it significantly easier to figure out why a game server pod behaved the way it did, and seems to fit in with the other ways that a game server moves between states.

That being said, I just remembered that CrashLoopBackoff is likely to throw a fairly large wrench in the works of something like this. The time to reset the backoff is 10 minutes(of continuous exit-free run time): anything less results in the backoff doubling, starting at 10 seconds and increasing up to 5 minutes. We'd definitely get some value out of the first restart(immediate) and probably some out of the second(10 seconds), but anything past that and I'm not sure the tradeoff makes sense anymore. However, I still think there is value in getting that first reset(and setting a label or annotation of "last restart time" or something so we can determine whether another restart makes sense, but this definitely isn't as straightforward a solution as I initially thought it would be.

There's an issue open in the kubernetes repo to allow tuning of the backoff, but it's several years old without any real progress: https://github.com/kubernetes/kubernetes/issues/57291

Oct 28 '22 18:10 austin-space

because it makes it significantly easier to figure out why a game server pod behaved the way it did, and seems to fit in with the other ways that a game server moves between states.

That is a really good point. You will be able to see from the GameServer event stream that it is the SDK that moved the GameServer back to Scheduled, so you know exactly what is going on.

That being said, I just remembered that CrashLoopBackoff is likely to throw a fairly large wrench in the works of something like this.

Erk! That sucks, I really liked this idea!

I do like the workaround of having a shell script, something like:

#!/bin/bash

while true; do
    ./gameserver
done

Although probably with better exit code handling.

Two thoughts:

Is it still useful to be able to go back to being Scheduled in that scenario?
Do you think we could build a generic enough wrapper for a process that we could drop it in our examples dir? (I think we can - basically the same loop as above, but exit on a non-zero exit code) - and would that be useful?

Oct 28 '22 19:10 markmandel

It's definitely useful, and semantically correct. The game server is(hopefully) booting back up, so we should probably treat it as such. It isn't allocated any longer, and it definitely isn't ready.
A provided example wrapper would definitely be useful, especially in demonstrating how the Restart() method is intended to be used(showing that you shouldn't restart the container).

I'd still like some way to restart the container when viable, just to clear out any changes we may have made to the state in the container. So ideally somewhere between the wrapper and the game server we'd be able to determine that we've crossed the restart threshold and it should be safe to restart the container without incurring any backoff penalties. I'll throw something together that does all that when I get a minute.

Oct 28 '22 19:10 austin-space

Got bored and wrote some bash scripts, and they aren't that bad!

loop.sh

#!/bin/bash
while "$@"; do :; done

run.fail.sh

#!/bin/bash

echo "FAIL!"
sleep 1
exit 2

run.pass.sh

#!/bin/bash

echo "Running!"
sleep 1

Results:

➜  shell ./loop.sh ./run.pass.sh
Running!
Running!
Running!
Running!
Running!
Running!
Running!
Running!
^C
➜  shell ./loop.sh ./run.fail.sh
FAIL!
➜  shell

Oct 28 '22 19:10 markmandel

Actually yeah, for an example those are way simpler than what I was talking about and much easier to understand. I can build whatever container restart timing stuff I need on top of that.

Oct 28 '22 19:10 austin-space

The only downside I can see to exiting the process / container is that you would need to load in all your data from memory again, rather than being able to reuse it if you switched back to a zero state.

But if you don't have much / any data to load (thinking relay serves here), this could be an interesting approach.

For a workaround for doing the above right now, rather than wait for a SDK to switch back to Scheduled, you could set a label that would move that GameServer out of any Allocation filtering that would occur.

Oct 28 '22 20:10 markmandel

I've been wondering if there are solutions we could do on the SDK that wouldn't even really look like an exit: if you had a "golden" point in your initialization that you wanted to preserve, we could potentially fork() at that point and keep a hot standby process. One approach that might work is to have the original process bind to the healthcheck port first, then the child of the fork would just wait on being able to bind to it. I thiiink (would have to check) that as long as there's a process receiving liveness checks, you should be able to flap between processes like this indefinitely.

Oct 28 '22 20:10 zmerlynn

f you had a "golden" point in your initialization that you wanted to preserve, we could potentially fork() at that point and keep a hot standby process.

I assume you mean, this would be something you do inside your game server binary process itself?

Oct 28 '22 20:10 markmandel

f you had a "golden" point in your initialization that you wanted to preserve, we could potentially fork() at that point and keep a hot standby process.

I assume you mean, this would be something you do inside your game server binary process itself?

Yeah. It would require cooperation of the game server binary, so it's intrusive (unlike reporter's solution, which would work on legacy binaries), but it has a similar advantage: you no longer have to reason about whether you reset the game state sufficiently, the child process is effectively the reset state. As long as, when the child becomes the primary server, you fork a new child before accruing state, you'll be at the "golden" state.

Oct 28 '22 20:10 zmerlynn

@zmerlynn that makes sense! Then you could reuse in-memory state at that point.

So I'm thinking of two things for this ticket:

a SDK.Reset() function (I don't love that name though, can we come up with something better?) that returns the GameServer back to a Scheduled state.
A section in https://agones.dev/site/docs/integration-patterns/ on how to restart the process and return to Scheduled state (can cover both bash and forking a process).

Some other ideas for method names, none of which I'm super happy about:

SDK.Scheduled() (Since we have SDK.Ready() and SDK.Shutdown())
SDK.SignalScheduled()

Howzat sound?

Oct 28 '22 21:10 markmandel

I poked at my fork suggestion briefly and .. maybe it would work in a language other than Go, but at least in Go, fork support is pretty much limited to fork/exec and not "true" fork(), due to needing to copy state of goroutines/etc.

Oct 29 '22 00:10 zmerlynn

@austin-space I loved the idea here so much, I put together a rather long proposal in #2794, which brings together a couple of different ideas (disruption controls and pod reuse). Rather than keep a separate issue, I'm going to close this issue out and redirect conversations towards #2794.

Nov 18 '22 22:11 zmerlynn

We decided the (original) proposal in #2794 was biting off too much at once and stripped it back to the core of the issue, so I'm reopening this one. I'd like to keep looking into this, though, as I think there's a real opportunity here to offer simple between-container reuse.

Nov 30 '22 14:11 zmerlynn

Hi! I am interested in using this feature, are there any updates on it?

Mar 20 '23 09:03 miai10

@miai10 I think we are blocked on https://github.com/kubernetes/kubernetes/issues/57291 for the time being. We have discussed internally options to get around blocking on that - and I'll be discussing it directly with SIG-node soon.

For the time being https://agones.dev/site/docs/integration-patterns/reusing-gameservers/ is the supported method for Pod re-use.

Apr 03 '23 17:04 zmerlynn

I was at the sig node meeting today. Have you considered having a supervisord (or similar init) process managing the game server within the container? The supervisor would restart the game server process without restarting the entire container.

May 09 '23 18:05 rphillips

@rphillips That's not too different from https://github.com/googleforgames/agones/issues/2781#issuecomment-1295394173, which is a hacky bash loop. I think it's an okay approach to handle it outside the process but within the container. That said, we've had specific requests for exiting the container fully instead - specifically, users don't want to have to reason about whether the process shut down "cleanly" within the container. In the game server context, a lot of users are using a framework (Unity, Unreal) that wraps their simulation, and users have seen cases where e.g. the framework modified Windows registry keys and left the container in a funky state.

So a supervisor is feasible if you're willing to reason about the container state in-between restarts, but allowing the container to be restarted would be ideal.

May 09 '23 18:05 zmerlynn

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

Feb 01 '24 10:02 github-actions[bot]

There is some traction on https://github.com/kubernetes/kubernetes/issues/57291, so keeping this from getting staled-out for now. If it can be changed upstream, we can do some pretty interesting patterns in Agones transparently.

Feb 01 '24 16:02 zmerlynn

agones agones copied to clipboard

Allow graceful restarts of game server containers instead of shutdown

agones
agones copied to clipboard