agones icon indicating copy to clipboard operation
agones copied to clipboard

FleetAutoscaler keeps alive all TLS connections permanently causing memory leak on webhook server

Open craftyc0der opened this issue 4 years ago • 2 comments

What happened: Over time my https server which hosts the FleetAutoscaler webhook goes OOM. This is caused by 1000s of never dying sockets on the server. This does NOT happen when I call it with cURL or a browser. It only happens with Agones calls the endpoint.

/app $ lsof -p $PID | grep socket
...
1       /app/zeus-rest  socket:[289294771]
1       /app/zeus-rest  socket:[289294783]
1       /app/zeus-rest  socket:[289292336]
1       /app/zeus-rest  socket:[289291653]
1       /app/zeus-rest  socket:[289291654]
1       /app/zeus-rest  socket:[289293769]
1       /app/zeus-rest  socket:[289294780]
...

/app $ lsof -p $PID | grep socket | wc -l
6397
/app $ lsof -p $PID | grep socket | wc -l
6403
/app $ lsof -p $PID | grep socket | wc -l
6418

What you expected to happen:

I expect that when the FleetAutoscaler is called by Agones is either reuses the TLS client or it disconnects it. Keeping it alive and then making a new one seems naughty.

How to reproduce it (as minimally and precisely as possible): Create a TLS FleetAutoscaler endpoint with keepalive turned on and no timeout specified and watch the sockets multiply.

Anything else we need to know?: I suspect that this could be repaired by adding to pkg/fleetautoscalers/fleetautoscalers.go

var client = http.Client{
	Timeout: 15 * time.Second,
+++	Transport: &http.Transport{
+++                DisableKeepAlives: true,
+++        },
}

I fixed it by disabling KeepAlive on the server side. But it took me several hours to figure out the problem because I could not reproduce it with any clients of my own.

Environment:

  • Agones version: 1.16
  • Kubernetes version (use kubectl version): 1.21
  • Cloud provider or hardware configuration: EKS and Minikube
  • Install method (yaml/helm): helm
  • Troubleshooting guide log(s):
  • Others:

craftyc0der avatar Sep 25 '21 15:09 craftyc0der

Are you using a webhook autoscaler (as described in https://agones.dev/site/docs/getting-started/create-webhook-fleetautoscaler/)?

roberthbailey avatar Sep 25 '21 23:09 roberthbailey

What we probably should do is here: https://github.com/googleforgames/agones/blob/main/pkg/fleetautoscalers/fleetautoscalers.go#L118-L121

	client.Transport = &http.Transport{
		TLSClientConfig: &tls.Config{
			RootCAs: rootCAs,
		},

Rather than recreate the Transport, we should overwrite the TLSClientConfig.RootCAs values, to maintain the cached TCP connections.

More details: https://pkg.go.dev/net/http#Transport

markmandel avatar Sep 27 '21 16:09 markmandel