kiam icon indicating copy to clipboard operation
kiam copied to clipboard

kiam-server failing health checks and getting restarted

Open integrii opened this issue 5 years ago • 6 comments

We are having kiam-server version 3.2 intermittently fail health checks. We have increased the health checks to as follows, and it helped some, but we still get too many unexpected failures across our fleet.

Liveness:   exec [/kiam health --cert=/etc/kiam/tls/server.pem --key=/etc/kiam/tls/server-key.pem --ca=/etc/kiam/tls/ca.pem --server-address=localhost:443 --gateway-timeout-creation=5s --timeout=20s] delay=30s timeout=10s period=10s #success=1 #failure=3

Here is what it looks like when it's turned off because of failing health checks:

image

The resources are set to:

    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:      200m
      memory:   100Mi

But usage remains well below our requests, so I don't think that's the issue:

image

Where should I start looking to troubleshoot this health check issue? When kiam-server starts crashing, we have KIAM failures, which cascade into app failures directly.

Thanks for any help you can provide!

integrii avatar Apr 25 '19 22:04 integrii

What do you see in the logs? I'd check all the logs from the Server and Agent to understand whether the processes are starting successfully.

pingles avatar May 01 '19 10:05 pingles

Ah, just seen that you're only noticing it for intermittent requests. Any other tips @uswitch/cloud of things to check?

pingles avatar May 01 '19 10:05 pingles

The logs appear fine on startup. I don't have any on hand, but I've seen it start before and didn't notice any warnings.

Randomly, sometimes after months, we will see kiam-server get shut down, without being evicted, for failing health checks.

When this happens, our apps fail some requests, which causes customers to call our NOC, which then causes my team to be woken up at 2AM :-(. If I look right now, you can already see one restart on 2 of the 3 pods we have running in our test cluster after just 40 hours.

image

Does kiam-server properly start failing readiness checks when it gets a SIGINT from the OS during eviction? If we could make the shutdown and rolling update process seamless, that would also resolve the issue.

integrii avatar May 01 '19 18:05 integrii

Well the server process should collect either SIGINT or SIGTERM and trigger the gRPC server to stop:

https://github.com/uswitch/kiam/blob/master/cmd/kiam/server.go#L92

which in turn calls GracefulStop https://github.com/uswitch/kiam/blob/master/pkg/server/server.go#L303.

Additionally, agents should be using the client-side gRPC load-balancer to resolve currently running server processes and retry requests across those- do you have any of the agent log data to show that this process is failing?

Could you confirm the flags you pass to the agent for the server's address please?

The prometheus metrics documented in https://github.com/uswitch/kiam/blob/master/docs/METRICS.md should also have information to help diagnose if there's something misbehaving within the gRPC subsystem through grpc_server_handled_total and grpc_client_handled_total.

pingles avatar May 14 '19 12:05 pingles

Hi, There is only one master node in our kubernetes architecture so there is only 1 kiam server pod running on master node. Sometimes, kiam server pod fails the health checks and keep on restarting, during that time some of the pods are going down as they couldn't get the creds from aws. Can we run kiam-server on worker nodes as well ? is there any solution for this ?

vdoodala avatar Jun 11 '19 21:06 vdoodala

@vdoodala you could put kiam servers on few dedicated nodes probably, you should not give all nodes with such a wide-scope IAM access.

spinus avatar Apr 14 '20 09:04 spinus