tilt icon indicating copy to clipboard operation
tilt copied to clipboard

cluster liveness check fails

Open nicks opened this issue 2 years ago • 13 comments

Current Behavior

ahy in the slack channel reports that when they connect tilt to their remote cluster, it fails with:

Cluster status error: cluster did not pass liveness check

If they try to run the liveness check manually, they get

$ kubectl get --raw='/livez?verbose'
Error from server (NotFound): the server could not find the requested resource

It appears that livez was added in Kubernetes 1.16 and is not supported on their Rancher distro.

The confirm the /healthz check works though

Possible Solutions

Maybe we should only use /healthz? not sure what the additional benefit of using /livez is.

Alternatively, if we get a 404 from /livez, we could ignore it.

nicks avatar Nov 17 '22 16:11 nicks

@milas any chance you remember what the reasoning was behind the different health checks?

alternatively, maybe we just skip the health checks on older versions of kubernetes... https://kubernetes.io/docs/reference/using-api/health-checks/

nicks avatar Nov 17 '22 16:11 nicks

Used /livez because of this note from the health checks doc:

The healthz endpoint is deprecated (since Kubernetes v1.16), and you should use the more specific livez and readyz endpoints instead.


Alternatively, if we get a 404 from /livez, we could ignore it.

This seems reasonable - could also try to fallback to a /readyz in this case

milas avatar Nov 18 '22 18:11 milas

Running into this issue as well from our Rancher environment in:

❯ tilt version
v0.31.2, built 2023-02-10

We downgraded to the following version to continue to use tilt.

v0.28.1, built 2022-05-01

Browsing through the codebase, I believe an enhancement to verify against 404 can be implemented here. Additionally, can we fallback to /healthz as well?: https://github.com/tilt-dev/tilt/blob/95a35874112c38057685a3342c4924c83e9d1b7b/internal/k8s/client.go#L765-L786

On the other hand, Rancher can be updated to include /livez or readyz because Kubernetes documentation mentioned:

Machines that check the healthz/livez/readyz of the API server should rely on the HTTP status code.here:

I believe this is where Rancher generates the listener for /healthz. https://github.com/rancher/rancher/blob/e2410e02494a5b4bd43c50d8d45ed7df5a3ad0a8/pkg/api/steve/health/health.go#L10-L19

atsai1220 avatar Mar 08 '23 03:03 atsai1220

@atsai1220 how do you downgrade tilt? currently facing the same issue

lewis-kori avatar Mar 10 '23 07:03 lewis-kori

@atsai1220 how do you downgrade tilt? currently facing the same issue

Navigate to the Release page of this repository and download from the Assets menu of your desired version.

Copy the URL for your operating system and retrieve the package:

wget https://github.com/tilt-dev/tilt/releases/download/v0.28.1/tilt.0.28.1.linux.x86_64.tar.gz

atsai1220 avatar Mar 16 '23 04:03 atsai1220

Any plan to add "/healthz" to cluster api health checks? I'm working on k8s 1.20.15 via Rancher. and currently blocked from using latest tilt version :(

MatanAmoyal1 avatar Sep 18 '23 14:09 MatanAmoyal1

@MatanAmoyal1 hmmm... /livez should work fine in k8s 1.20, are you sure you're not hitting some other issue / blocking it some other way?

nicks avatar Sep 18 '23 16:09 nicks

@nicks it's looks like the same issue. (k8s 1.20 via Rancher) healthz works, but livez not.

` ➜ ~ kubectl proxy&
[1] 37020 ➜ ~ Starting to serve on 127.0.0.1:8001

➜ ~ curl 127.0.0.1:8001/healthz ok%
➜ ~ curl 127.0.0.1:8001/livez
404 page not found

`

MatanAmoyal1 avatar Sep 19 '23 05:09 MatanAmoyal1

@nicks any plan to merge this PR https://github.com/tilt-dev/tilt/pull/6065 ?

MatanAmoyal1 avatar Oct 03 '23 06:10 MatanAmoyal1

fwiw, i have been unable to reproduce this problem:

k3d cluster create -i rancher/k3s:v1.20.15-k3s1
kubectl get --raw='/livez?verbose'

seems to produce a valid healthcheck for me. is it possible that your devops team is blocking the kubernetes healthcheck routes?

nicks avatar Oct 26 '23 01:10 nicks

Unfortunately I'm stuck using a version of Openshift 3, (k8s v1.11) and so I'm unable to use the current version of Tilt as the livez endpoint is not present. Is there any plans on fixing this issue? So far i've been using v0.28.1 and it works

samuellvicente avatar Oct 26 '23 14:10 samuellvicente

We are using ranchers included k3s kubernetes, and its livez check is behind authentication: https://github.com/k3s-io/k3s/issues/3576#issuecomment-875041119

So, we are sadly also forced to fall back to an older tilt version...

Richie24 avatar Jan 02 '24 17:01 Richie24

@Richie24 the issue you pointed to is a 401 rather than the 404 reported in other comments, so it sounds like you're hitting a different problem. fwiw, tilt uses your kubectl credentials, so auth shouldn't affect things.

nicks avatar Jan 02 '24 19:01 nicks