FleetAutoscaler keeps alive all TLS connections permanently causing memory leak on webhook server
What happened:
Over time my https server which hosts the FleetAutoscaler webhook goes OOM. This is caused by 1000s of never dying sockets on the server. This does NOT happen when I call it with cURL or a browser. It only happens with Agones calls the endpoint.
/app $ lsof -p $PID | grep socket
...
1 /app/zeus-rest socket:[289294771]
1 /app/zeus-rest socket:[289294783]
1 /app/zeus-rest socket:[289292336]
1 /app/zeus-rest socket:[289291653]
1 /app/zeus-rest socket:[289291654]
1 /app/zeus-rest socket:[289293769]
1 /app/zeus-rest socket:[289294780]
...
/app $ lsof -p $PID | grep socket | wc -l
6397
/app $ lsof -p $PID | grep socket | wc -l
6403
/app $ lsof -p $PID | grep socket | wc -l
6418
What you expected to happen:
I expect that when the FleetAutoscaler is called by Agones is either reuses the TLS client or it disconnects it. Keeping it alive and then making a new one seems naughty.
How to reproduce it (as minimally and precisely as possible): Create a TLS FleetAutoscaler endpoint with keepalive turned on and no timeout specified and watch the sockets multiply.
Anything else we need to know?:
I suspect that this could be repaired by adding to
pkg/fleetautoscalers/fleetautoscalers.go
var client = http.Client{
Timeout: 15 * time.Second,
+++ Transport: &http.Transport{
+++ DisableKeepAlives: true,
+++ },
}
I fixed it by disabling KeepAlive on the server side. But it took me several hours to figure out the problem because I could not reproduce it with any clients of my own.
Environment:
- Agones version: 1.16
- Kubernetes version (use
kubectl version): 1.21 - Cloud provider or hardware configuration: EKS and Minikube
- Install method (yaml/helm): helm
- Troubleshooting guide log(s):
- Others:
Are you using a webhook autoscaler (as described in https://agones.dev/site/docs/getting-started/create-webhook-fleetautoscaler/)?
What we probably should do is here: https://github.com/googleforgames/agones/blob/main/pkg/fleetautoscalers/fleetautoscalers.go#L118-L121
client.Transport = &http.Transport{
TLSClientConfig: &tls.Config{
RootCAs: rootCAs,
},
Rather than recreate the Transport, we should overwrite the TLSClientConfig.RootCAs values, to maintain the cached TCP connections.
More details: https://pkg.go.dev/net/http#Transport