csi-driver icon indicating copy to clipboard operation
csi-driver copied to clipboard

Daemonset crashloopback in openshift

Open itmwiw opened this issue 2 years ago • 9 comments

Hello, I have an Openshift Cluster and I try to use hetznercloud csi-drive. However, all daemonset's pods are in CrashLoopBackOff state. Here's the logs:

[pod/hcloud-csi-node-45xqq/hcloud-csi-driver] level=error ts=2023-04-11T14:33:12.085976239Z msg="failed to fetch server ID from metadata service" err="Get \"http://169.254.169.254/hetzner/v1/metadata/instance-id\": dial tcp 169.254.169.254:80: connect: connection refused"

I guess this is related to what is described in here https://github.com/hetznercloud/csi-driver/issues/143. This issue was closed because version 1.6.0 attempts to use the environment variable HCLOUD_SERVER_ID or KUBE_NODE_NAME with a call to HCloudClient before falling back to the MetadataClient. However v2.2.0 doesn't do that anymore, so I guess the issue is back. Can you help me on this? Regards, Tarik

itmwiw avatar Apr 11 '23 17:04 itmwiw

Hey, this was changed in #269, so we can remove access to the Hetzner Cloud API from the daemon set. We would prefer to keep the daemon set ("node" binary) as small as possible, so adding back access to the API is not what we want.

@samcday Do you have an idea how we can solve this for OpenShift where access to the metadata service is blocked?

apricote avatar Apr 12 '23 06:04 apricote

Oh, forgot to mention. The Server ID and Location, which are the two fields retrieved from the Metadata Service are used in the response to NodeGetInfo: https://github.com/hetznercloud/csi-driver/blob/cbb7750af17224e256fcb62da5358a9743080a9f/driver/node.go#L194-L205

apricote avatar Apr 12 '23 06:04 apricote

Hm. Tricky one. My original hope was to use k8s Node metadata as source of truth for this, thus tying csi-driver to hccm. But of course that violates the CSI abstraction and won't work for other container orchestrators.

Ultimately, the only way for us to determine this information from a particular node, without assuming any access to a control plane / orchestrator API of any kind, means we can only fetch this information from the metadata service, or fallback to statically provided information.

... Or we just add back the HCLOUD_TOKEN requirement for the node binary, so that it can fetch this info from the API. That would be a bummer from a purist technical point of view, but maybe it's the only way we can keep the CSI driver running reliably (and reasonably ergonomically!) across multiple orchestrators.

samcday avatar Apr 12 '23 09:04 samcday

One other somewhat hacky idea: we could do the metadata API lookup in a small initContainer that uses hostNetwork: true and then pass that information along to the main (not host-networking) process.

samcday avatar Apr 12 '23 09:04 samcday

One other somewhat hacky idea: we could do the metadata API lookup in a small initContainer that uses hostNetwork: true and then pass that information along to the main (not host-networking) process.

Perhaps this is something that can be done only for Openshift through the Helm Chart?

apricote avatar Apr 12 '23 09:04 apricote

Perhaps this is something that can be done only for Openshift through the Helm Chart?

Yes, that sounds good :+1: Or even more generally: just a thing that you can opt into through values.yaml: helm install csi-driver --set initMetadataLookup=true or somesuch.


That said, it might just be better to always do it that way and keep the number of different deployment modes to a minimum. With such an approach, the node binary could remove all notion of HC API or metadata service, and require that all necessary metadata/topology info is injected through env. Some of this env comes from downward API, the rest comes from this proposed init container.

samcday avatar Apr 12 '23 12:04 samcday

I have the same issue in Openshift.

alrf avatar Apr 19 '23 11:04 alrf

I solved it in v2.3.2 using Topology=false here: https://github.com/hetznercloud/csi-driver/blob/dfe6183f4d0fddeefdff8069b1c09eeb38113b33/deploy/kubernetes/hcloud-csi.yml#L225 and added hostNetwork: true in DaemonSet on line 298: https://github.com/hetznercloud/csi-driver/blob/dfe6183f4d0fddeefdff8069b1c09eeb38113b33/deploy/kubernetes/hcloud-csi.yml#L298

alrf avatar May 05 '23 13:05 alrf

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

github-actions[bot] avatar Aug 04 '23 12:08 github-actions[bot]