csi-driver icon indicating copy to clipboard operation
csi-driver copied to clipboard

`hcloud-csi-driver` container is coming up, but then failing the healthz check

Open maggie44 opened this issue 6 months ago • 1 comments

TL;DR

hcloud-csi-driver container is coming up, but then failing the healthz check. Here are the logs that suggest things are healthy:

time=2025-06-16T08:43:48.785Z level=DEBUG source=/home/runner/work/csi-driver/csi-driver/internal/metrics/metrics.go:36 msg="registering metrics with registry"
time=2025-06-16T08:43:48.785Z level=DEBUG source=/home/runner/work/csi-driver/csi-driver/internal/metrics/metrics.go:43 msg="registered metrics"
--- Request:
GET /v1/servers?name=nes1-zpj HTTP/1.1
Host: api.hetzner.cloud
User-Agent: csi-driver/2.15.0 hcloud-go/2.21.1
Authorization: REDACTED
Accept-Encoding: gzip

--- Response:
HTTP/2.0 200 OK
Access-Control-Allow-Credentials: true
Access-Control-Allow-Headers: X-Requested-With,Authorization,Content-Type
Access-Control-Allow-Methods: OPTIONS,GET,POST,PUT,PATCH,DELETE
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Link,X-Correlation-ID
Access-Control-Max-Age: 86400
Content-Type: application/json
Date: Mon, 16 Jun 2025 08:43:49 GMT
Link: <https://api.hetzner.cloud/v1/servers?name=nes1-zpj&page=1>; rel=last
Ratelimit-Limit: 3600
Ratelimit-Remaining: 3599
Ratelimit-Reset: 1750063429
Strict-Transport-Security: max-age=31536000; includeSubDomains
Vary: Accept-Encoding
X-Correlation-Id: f8cbc0953ef1ffea

...

time=2025-06-16T08:43:49.183Z level=DEBUG source=/home/runner/work/csi-driver/csi-driver/internal/app/app.go:257 msg="fetched server via server name from KUBE_NODE_NAME env var" server-id=65738927
time=2025-06-16T08:43:49.183Z level=DEBUG source=/home/runner/work/csi-driver/csi-driver/cmd/controller/main.go:56 msg="evaluated default location for volumes" location=fsn1

Yet from inside the container, the health check fails. Subsequently, Kubernetes keeps restarting the container:

curl -v http://localhost:9808/
lhost:9808/metrics* Host localhost:9808 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying [::1]:9808...
* Connected to localhost (::1) port 9808
* using HTTP/1.x
> GET / HTTP/1.1
> Host: localhost:9808
> User-Agent: curl/8.14.1
> Accept: */*
> 
* Request completely sent off
* Recv failure: Connection reset by peer
* closing connection #0

curl -v http://localhost:9808/metrics
* Host localhost:9808 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying [::1]:9808...
* Connected to localhost (::1) port 9808
* using HTTP/1.x
> GET /metrics HTTP/1.1
> Host: localhost:9808
> User-Agent: curl/8.14.1
> Accept: */*
> 
* Request completely sent off
* Recv failure: Connection reset by peer
* closing connection #0

How can I see what the healthz endpoint is checking, and get accurate logs for why it is failing?

Expected behavior

A healthy status, or at least an error reporting why it is not healthy.

Observed behavior

Kubernetes triggering a restart loop.

Minimal working example

No response

Log output


Additional information

No response

maggie44 avatar Jun 16 '25 10:06 maggie44

Hi, Could you please provide the steps to reproduce the issue? For example, details such as the CSI driver configuration, volume setup, the output of kubectl -n kube-system describe deployments.apps -l app.kubernetes.io/name=hcloud-csi, and the complete log output from the hcloud-csi-driver pod would be helpful.

Port 9808 is the livenessProbe of the hcloud-csi-driver container in the hcloud-csi-controller Pod. It is using the /healthz endpoint.

curl http://localhost:9808/healthz

This endpoint is provided through the kubernetes-csi/livenessprobe sidecar, which checks if the csi-drivers unix socket is answering the Probe() grpc call.

lukasmetzner avatar Jun 18 '25 05:06 lukasmetzner

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

github-actions[bot] avatar Sep 22 '25 12:09 github-actions[bot]