agones icon indicating copy to clipboard operation
agones copied to clipboard

[Feature Proposal] - Make agones sidecar send signal to gameserver container

Open swermin opened this issue 7 months ago • 10 comments

Is your feature request related to a problem? Please describe. In our runs we have seen that if the game server stalls for more than the health check is configured for then the game server is terminated and pod restarted. This is expected behavior from kubernetes and agones in itself. Our problem is that we have no idea why that specific game server container has stalled and therefore cannot do any investigation. If we can make the sidecar container send a signal, something like kill -4 then we can use that generated core dump to inspect what was happening on the game server at that specific time and find the root cause.

Describe the solution you'd like We would love to be able configure the agones sidecar container to send a signal to the game server container in case that the state of the game server is Unhealthy, this is to force a core dump to be generated.

Describe alternatives you've considered We have tried adding a prestop hook to our container, to force generate a crash by sending a signal. The prestop hook connects to the sidecar and requests the gameserver spec so that we can look if the state is Unhealthy. From our investigations it is clear that the prestop hook isn't working. It is either that the prestop hook is not being called or the agones sidecar container has shutdown before the prestop hook can fetch the necessary gameserver spec. From the logs we can see that the sidecar does not wait for the container to terminate before closing.

See this thread for the exact logs: https://agones.slack.com/archives/C9DGM5DS8/p1739901809862829

Additional context Add any other context or screenshots about the feature request here.

Link to the Agones Feature Proposal (if any) The link to the AFP PR.

Discussion Link (if any) Some investigation done on this is captured in slack; https://agones.slack.com/archives/C9DGM5DS8/p1739901809862829

swermin avatar May 28 '25 08:05 swermin

Are you using Agones' current sidecar, or the new sidecar feature flag #3642?

igooch avatar May 28 '25 17:05 igooch

We’ve only tried the the current sidecar, I didn’t know the new sidecar feature was released. Do you think that would help in this specific case? The new feature that is

swermin avatar May 28 '25 20:05 swermin

The new sidecar feature should help with the lifecycle. In particular for your use case the sdkserver should be alive for the entire lifecycle of the pod's main container. This wouldn't necessarily solve your use case, but it's possible it could work with the pre-stop hook you'd mentioned as an alternative.

That being said sidecar was just released in alpha and requires improvements #4188. Eventually we will be moving everything over to SidecarContainers, so if this feature needs changes in the main code base we should take that into account.

igooch avatar May 29 '25 21:05 igooch

🤔 how could a sidecar even send a linux signal to the running process in a different container? I don't think we have that kind of control within a Pod?.

Do you have suggestions on how this would be possible? Because I don't think this actually can be done.

markmandel avatar May 31 '25 15:05 markmandel

You can enable this with shareProcessNamespace on the Pod spec. But that has security ramifications.

https://kubernetes.io/docs/tasks/configure-pod-container/share-process-namespace/

towolf avatar Jun 04 '25 13:06 towolf

But that has security ramifications.

Image

Our problem is that we have no idea why that specific game server container has stalled and therefore cannot do any investigation.

But on a serious note -- no log aggregation for your game server clusters?

markmandel avatar Jun 04 '25 19:06 markmandel

We are running our pods with shared pid namespace. All of our containers are from scratch so there is not much you can do with it.

Do you have suggestions on how this would be possible?

Our prestopHook is quite simple:

  1. Connect to the sidecar
  2. Read the state of the game server
  3. if Unhealthy
  4. Iterate through all /prox/x/status and search the one that container the gameserver container name
  5. If we find it we send a regular syscall.Kill(pid, syscall.SIGTRAP)

But on a serious note -- no log aggregation for your game server clusters?

We do have log aggregation. The problem is that when our gameservers container ends up in a deadlock it stops printing completely. And it happens at random intervals that there are no log lines to help pinpoint this.

On a different note, I tried upgrading to agones 1.49.0 and enabled SidecarContainers. To test if this works I added a prestop hook that does this:

  1. Connect to the agones gameserver sdk container
  2. Read the .Status of the game server
  3. if Ready then send a kill -4

when I do a normal kubectl delete pod <gameserver_pod> absolutely nothing happens. If I disable the SidecarContainers feature and re-run the steps above then I do get a crash generated.

But here is the kicker, when the health checks fail then the sidecar terminates before we can read from it so the prestop hook cannot do anything.

With the new SidecarContainers feature I cannot seem to trigger anything at all, so it makes the debugging harder.

swermin avatar Jun 05 '25 18:06 swermin

What container is your prestop hook configured against?

markmandel avatar Jun 05 '25 22:06 markmandel

The gameserver container

swermin avatar Jun 06 '25 07:06 swermin

But here is the kicker, when the health checks fail then the sidecar terminates before we can read from it so the prestop hook cannot do anything.

That seems.. odd.

From: https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/#sidecar-containers-and-pod-lifecycle

If an init container is created with its restartPolicy set to Always, it will start and remain running during the entire life of the Pod.

So it definitely should be running - unless the description is wrong.

A call to the PreStop hook fails if the container is already in a terminated or completed state and the hook must complete before the TERM signal to stop the container can be sent.

Maybe? Is the container already terminated somehow? Which would explain what happens above.

For another fun idea: https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/ - should be beta in 1.30+

Rather than explicitly doing a signal call -- let's be more generic and capturing the entire container for restart somewhere else.

markmandel avatar Jun 06 '25 07:06 markmandel

You can always run your own sidecar that watches it's owning GameServer for state changes through kubernetes API and does whatever you want with your app when it changes to Unhealthy?

unlightable avatar Jul 03 '25 11:07 unlightable

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

github-actions[bot] avatar Aug 15 '25 10:08 github-actions[bot]

I found another way of dealing with this.

I've created a game server observer pod, which uses informers to watch agones events. Basically doing kubectl -n gs get events --field-selector type=Warning,involvedObject.kind=GameServer,reportingComponent=gameserver-sidecar. If an event is found then we proceed to crash the game server process by installing an ephemeral container which has its sole job is to send a SIGILL to the gameserver process.

May not be pretty but it got us around the issue

swermin avatar Aug 17 '25 16:08 swermin

I'll close this issue then unless anyone thinks this feature is still a good thing to have?

swermin avatar Aug 17 '25 16:08 swermin