[Feature Proposal] - Make agones sidecar send signal to gameserver container
Is your feature request related to a problem? Please describe.
In our runs we have seen that if the game server stalls for more than the health check is configured for then the game server is terminated and pod restarted. This is expected behavior from kubernetes and agones in itself.
Our problem is that we have no idea why that specific game server container has stalled and therefore cannot do any investigation.
If we can make the sidecar container send a signal, something like kill -4 then we can use that generated core dump to inspect what was happening on the game server at that specific time and find the root cause.
Describe the solution you'd like We would love to be able configure the agones sidecar container to send a signal to the game server container in case that the state of the game server is Unhealthy, this is to force a core dump to be generated.
Describe alternatives you've considered
We have tried adding a prestop hook to our container, to force generate a crash by sending a signal.
The prestop hook connects to the sidecar and requests the gameserver spec so that we can look if the state is Unhealthy.
From our investigations it is clear that the prestop hook isn't working. It is either that the prestop hook is not being called or the agones sidecar container has shutdown before the prestop hook can fetch the necessary gameserver spec.
From the logs we can see that the sidecar does not wait for the container to terminate before closing.
See this thread for the exact logs: https://agones.slack.com/archives/C9DGM5DS8/p1739901809862829
Additional context Add any other context or screenshots about the feature request here.
Link to the Agones Feature Proposal (if any) The link to the AFP PR.
Discussion Link (if any) Some investigation done on this is captured in slack; https://agones.slack.com/archives/C9DGM5DS8/p1739901809862829
Are you using Agones' current sidecar, or the new sidecar feature flag #3642?
We’ve only tried the the current sidecar, I didn’t know the new sidecar feature was released. Do you think that would help in this specific case? The new feature that is
The new sidecar feature should help with the lifecycle. In particular for your use case the sdkserver should be alive for the entire lifecycle of the pod's main container. This wouldn't necessarily solve your use case, but it's possible it could work with the pre-stop hook you'd mentioned as an alternative.
That being said sidecar was just released in alpha and requires improvements #4188. Eventually we will be moving everything over to SidecarContainers, so if this feature needs changes in the main code base we should take that into account.
🤔 how could a sidecar even send a linux signal to the running process in a different container? I don't think we have that kind of control within a Pod?.
Do you have suggestions on how this would be possible? Because I don't think this actually can be done.
You can enable this with shareProcessNamespace on the Pod spec. But that has security ramifications.
https://kubernetes.io/docs/tasks/configure-pod-container/share-process-namespace/
But that has security ramifications.
Our problem is that we have no idea why that specific game server container has stalled and therefore cannot do any investigation.
But on a serious note -- no log aggregation for your game server clusters?
We are running our pods with shared pid namespace. All of our containers are from scratch so there is not much you can do with it.
Do you have suggestions on how this would be possible?
Our prestopHook is quite simple:
- Connect to the sidecar
- Read the state of the game server
- if Unhealthy
- Iterate through all
/prox/x/statusand search the one that container the gameserver container name - If we find it we send a regular
syscall.Kill(pid, syscall.SIGTRAP)
But on a serious note -- no log aggregation for your game server clusters?
We do have log aggregation. The problem is that when our gameservers container ends up in a deadlock it stops printing completely. And it happens at random intervals that there are no log lines to help pinpoint this.
On a different note, I tried upgrading to agones 1.49.0 and enabled SidecarContainers. To test if this works I added a prestop hook that does this:
- Connect to the agones gameserver sdk container
- Read the .Status of the game server
- if Ready then send a
kill -4
when I do a normal kubectl delete pod <gameserver_pod> absolutely nothing happens.
If I disable the SidecarContainers feature and re-run the steps above then I do get a crash generated.
But here is the kicker, when the health checks fail then the sidecar terminates before we can read from it so the prestop hook cannot do anything.
With the new SidecarContainers feature I cannot seem to trigger anything at all, so it makes the debugging harder.
What container is your prestop hook configured against?
The gameserver container
But here is the kicker, when the health checks fail then the sidecar terminates before we can read from it so the prestop hook cannot do anything.
That seems.. odd.
From: https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/#sidecar-containers-and-pod-lifecycle
If an init container is created with its restartPolicy set to Always, it will start and remain running during the entire life of the Pod.
So it definitely should be running - unless the description is wrong.
A call to the PreStop hook fails if the container is already in a terminated or completed state and the hook must complete before the TERM signal to stop the container can be sent.
Maybe? Is the container already terminated somehow? Which would explain what happens above.
For another fun idea: https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/ - should be beta in 1.30+
Rather than explicitly doing a signal call -- let's be more generic and capturing the entire container for restart somewhere else.
You can always run your own sidecar that watches it's owning GameServer for state changes through kubernetes API and does whatever you want with your app when it changes to Unhealthy?
'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '
I found another way of dealing with this.
I've created a game server observer pod, which uses informers to watch agones events. Basically doing kubectl -n gs get events --field-selector type=Warning,involvedObject.kind=GameServer,reportingComponent=gameserver-sidecar.
If an event is found then we proceed to crash the game server process by installing an ephemeral container which has its sole job is to send a SIGILL to the gameserver process.
May not be pretty but it got us around the issue
I'll close this issue then unless anyone thinks this feature is still a good thing to have?