kiam
kiam copied to clipboard
kiam-server failing health checks and getting restarted
We are having kiam-server
version 3.2 intermittently fail health checks. We have increased the health checks to as follows, and it helped some, but we still get too many unexpected failures across our fleet.
Liveness: exec [/kiam health --cert=/etc/kiam/tls/server.pem --key=/etc/kiam/tls/server-key.pem --ca=/etc/kiam/tls/ca.pem --server-address=localhost:443 --gateway-timeout-creation=5s --timeout=20s] delay=30s timeout=10s period=10s #success=1 #failure=3
Here is what it looks like when it's turned off because of failing health checks:
The resources are set to:
Limits:
cpu: 1
memory: 1Gi
Requests:
cpu: 200m
memory: 100Mi
But usage remains well below our requests, so I don't think that's the issue:
Where should I start looking to troubleshoot this health check issue? When kiam-server starts crashing, we have KIAM failures, which cascade into app failures directly.
Thanks for any help you can provide!
What do you see in the logs? I'd check all the logs from the Server and Agent to understand whether the processes are starting successfully.
Ah, just seen that you're only noticing it for intermittent requests. Any other tips @uswitch/cloud of things to check?
The logs appear fine on startup. I don't have any on hand, but I've seen it start before and didn't notice any warnings.
Randomly, sometimes after months, we will see kiam-server get shut down, without being evicted, for failing health checks.
When this happens, our apps fail some requests, which causes customers to call our NOC, which then causes my team to be woken up at 2AM :-(. If I look right now, you can already see one restart on 2 of the 3 pods we have running in our test cluster after just 40 hours.
Does kiam-server properly start failing readiness checks when it gets a SIGINT from the OS during eviction? If we could make the shutdown and rolling update process seamless, that would also resolve the issue.
Well the server process should collect either SIGINT or SIGTERM and trigger the gRPC server to stop:
https://github.com/uswitch/kiam/blob/master/cmd/kiam/server.go#L92
which in turn calls GracefulStop
https://github.com/uswitch/kiam/blob/master/pkg/server/server.go#L303.
Additionally, agents should be using the client-side gRPC load-balancer to resolve currently running server processes and retry requests across those- do you have any of the agent log data to show that this process is failing?
Could you confirm the flags you pass to the agent for the server's address please?
The prometheus metrics documented in https://github.com/uswitch/kiam/blob/master/docs/METRICS.md should also have information to help diagnose if there's something misbehaving within the gRPC subsystem through grpc_server_handled_total
and grpc_client_handled_total
.
Hi, There is only one master node in our kubernetes architecture so there is only 1 kiam server pod running on master node. Sometimes, kiam server pod fails the health checks and keep on restarting, during that time some of the pods are going down as they couldn't get the creds from aws. Can we run kiam-server on worker nodes as well ? is there any solution for this ?
@vdoodala you could put kiam servers on few dedicated nodes probably, you should not give all nodes with such a wide-scope IAM access.