csi-driver icon indicating copy to clipboard operation
csi-driver copied to clipboard

Error in IPv6 environment

Open zejar opened this issue 6 years ago • 8 comments

The hcloud csi driver does not seem to work in my IPv6 Kubernetes cluster. The cluster has been configured using kubeadm and the CNI Calico has been configured using an IPv6 pool. The routing works successfully and the container can reach the IPv4 internet via a NAT64 gateway. The only issue I currently have is with persistent storage, I wanted to use the hcloud csi driver for this. I configured the hcloud driver first with the hcloud api secret as described in the readme, afterwards I applied the files https://raw.githubusercontent.com/kubernetes/csi-api/release-1.14/pkg/crd/manifests/csidriver.yaml and https://raw.githubusercontent.com/kubernetes/csi-api/release-1.14/pkg/crd/manifests/csinodeinfo.yaml. The setup was finished off with this file: https://raw.githubusercontent.com/hetznercloud/csi-driver/v1.2.2/deploy/kubernetes/hcloud-csi.yml (I set the metric host as ":::9189" instead of the default "0.0.0.0:9189" but this does not seem the matter when deploying with the default metric host). When starting the hcloud csi driver I get hit with this error:

Notebook:~ zejar$ kubectl -n kube-system describe pods hcloud-csi-controller-0
Name:               hcloud-csi-controller-0
Namespace:          kube-system
Priority:           0
PriorityClassName:  <none>
Node:               worker-2/fd86:ea04:1111::12
Start Time:         Thu, 09 Jan 2020 07:37:13 +0100
Labels:             app=hcloud-csi-controller
                    controller-revision-hash=hcloud-csi-controller-7869579c4d
                    statefulset.kubernetes.io/pod-name=hcloud-csi-controller-0
Annotations:        cni.projectcalico.org/podIP: fd00:1234::3:88f1/128
                    cni.projectcalico.org/podIPs: fd00:1234::3:88f1/128
Status:             Running
IP:                 fd00:1234::3:88f1
Controlled By:      StatefulSet/hcloud-csi-controller
Containers:
  csi-attacher:
    Container ID:  docker://a87310584ca60d2813cec0f4c93af10002d16580cc6a448d2a050937be3d99f4
    Image:         quay.io/k8scsi/csi-attacher:v1.2.1
    Image ID:      docker-pullable://quay.io/k8scsi/csi-attacher@sha256:9125ce3c5c2ecfb5e17631190a3c839694b08cec172dd3da40d098a1b5eed89e
    Port:          <none>
    Host Port:     <none>
    Args:
      --csi-address=/var/lib/csi/sockets/pluginproxy/csi.sock
      --v=5
    State:          Running
      Started:      Thu, 09 Jan 2020 07:37:15 +0100
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/csi/sockets/pluginproxy/ from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from hcloud-csi-token-fzs9x (ro)
  csi-resizer:
    Container ID:  docker://3bece9a04d0e0be2ec11ae161053499828a735c5a2fd3522867d4fde5c03a105
    Image:         quay.io/k8scsi/csi-resizer:v0.3.0
    Image ID:      docker-pullable://quay.io/k8scsi/csi-resizer@sha256:eff2d6a215efd9450d90796265fc4d8832a54a3a098df06edae6ff3a5072b08f
    Port:          <none>
    Host Port:     <none>
    Args:
      --csi-address=/var/lib/csi/sockets/pluginproxy/csi.sock
      --v=5
    State:          Running
      Started:      Thu, 09 Jan 2020 07:37:15 +0100
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/csi/sockets/pluginproxy/ from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from hcloud-csi-token-fzs9x (ro)
  csi-provisioner:
    Container ID:  docker://7062801818abe94c5ae217df07d77a0627abb72059d9b1ce2a2a71155f90a4c6
    Image:         quay.io/k8scsi/csi-provisioner:v1.3.1
    Image ID:      docker-pullable://quay.io/k8scsi/csi-provisioner@sha256:d657c839dce87324fe2b677302913f9386f885f8746be7bea0ced5b0844e3433
    Port:          <none>
    Host Port:     <none>
    Args:
      --provisioner=csi.hetzner.cloud
      --csi-address=/var/lib/csi/sockets/pluginproxy/csi.sock
      --feature-gates=Topology=true
      --v=5
    State:          Running
      Started:      Thu, 09 Jan 2020 07:37:47 +0100
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Thu, 09 Jan 2020 07:37:16 +0100
      Finished:     Thu, 09 Jan 2020 07:37:46 +0100
    Ready:          True
    Restart Count:  1
    Environment:    <none>
    Mounts:
      /var/lib/csi/sockets/pluginproxy/ from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from hcloud-csi-token-fzs9x (ro)
  hcloud-csi-driver:
    Container ID:   docker://07e30448705c4607541ffe385eb4d958d2f6c0dfd74fcfc6eb67b70d08196c78
    Image:          hetznercloud/hcloud-csi-driver:1.2.2
    Image ID:       docker-pullable://hetznercloud/hcloud-csi-driver@sha256:c17cd36fbc4223d76824e164f0238393cd21e0cc9f8710d807b532fbd7f0f480
    Port:           9189/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 09 Jan 2020 07:58:47 +0100
      Finished:     Thu, 09 Jan 2020 07:58:47 +0100
    Ready:          False
    Restart Count:  9
    Environment:
      CSI_ENDPOINT:      unix:///var/lib/csi/sockets/pluginproxy/csi.sock
      METRICS_ENDPOINT:  :::9189
      HCLOUD_TOKEN:      <set to the key 'token' in secret 'hcloud-csi'>  Optional: false
    Mounts:
      /var/lib/csi/sockets/pluginproxy/ from socket-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from hcloud-csi-token-fzs9x (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  socket-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  hcloud-csi-token-fzs9x:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  hcloud-csi-token-fzs9x
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  24m                   default-scheduler  Successfully assigned kube-system/hcloud-csi-controller-0 to worker-2
  Normal   Started    24m                   kubelet, worker-2  Started container csi-resizer
  Normal   Created    24m                   kubelet, worker-2  Created container csi-attacher
  Normal   Started    24m                   kubelet, worker-2  Started container csi-attacher
  Normal   Pulled     24m                   kubelet, worker-2  Container image "quay.io/k8scsi/csi-resizer:v0.3.0" already present on machine
  Normal   Created    24m                   kubelet, worker-2  Created container csi-resizer
  Normal   Pulled     24m                   kubelet, worker-2  Container image "quay.io/k8scsi/csi-attacher:v1.2.1" already present on machine
  Normal   Created    24m                   kubelet, worker-2  Created container csi-provisioner
  Normal   Started    24m                   kubelet, worker-2  Started container csi-provisioner
  Normal   Pulling    24m (x3 over 24m)     kubelet, worker-2  Pulling image "hetznercloud/hcloud-csi-driver:1.2.2"
  Normal   Started    24m (x3 over 24m)     kubelet, worker-2  Started container hcloud-csi-driver
  Normal   Pulled     24m (x3 over 24m)     kubelet, worker-2  Successfully pulled image "hetznercloud/hcloud-csi-driver:1.2.2"
  Normal   Created    24m (x3 over 24m)     kubelet, worker-2  Created container hcloud-csi-driver
  Normal   Pulled     24m (x2 over 24m)     kubelet, worker-2  Container image "quay.io/k8scsi/csi-provisioner:v1.3.1" already present on machine
  Warning  BackOff    4m54s (x96 over 24m)  kubelet, worker-2  Back-off restarting failed container

and

Notebook:~ zejar$ kubectl -n kube-system logs -f hcloud-csi-controller-0 hcloud-csi-driver
level=debug ts=2020-01-09T06:53:36.846178303Z msg="getting instance id from metadata service"
level=error ts=2020-01-09T06:53:36.846729779Z msg="failed to get instance id from metadata service" err="Get http://169.254.169.254/2009-04-04/meta-data/instance-id: dial tcp 169.254.169.254:80: connect: network is unreachable"

My Environment

  • kubectl get nodes:
Notebook:~ zejar$ kubectl get nodes -o wide
NAME       STATUS   ROLES    AGE   VERSION   INTERNAL-IP          EXTERNAL-IP   OS-IMAGE                       KERNEL-VERSION   CONTAINER-RUNTIME
master-1   Ready    master   43h   v1.17.0   fd86:ea04:1111::1    <none>        Debian GNU/Linux 10 (buster)   4.19.0-6-amd64   docker://19.3.5
worker-1   Ready    <none>   43h   v1.17.0   fd86:ea04:1111::11   <none>        Debian GNU/Linux 10 (buster)   4.19.0-6-amd64   docker://19.3.5
worker-2   Ready    <none>   18h   v1.17.0   fd86:ea04:1111::12   <none>        Debian GNU/Linux 10 (buster)   4.19.0-6-amd64   docker://19.3.5
worker-3   Ready    <none>   18h   v1.17.0   fd86:ea04:1111::13   <none>        Debian GNU/Linux 10 (buster)   4.19.0-6-amd64   docker://19.3.5
  • kubectl get pods --all-namespaces:
Notebook:~ zejar$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                       READY   STATUS             RESTARTS   AGE
default       test-shell-66f858c55f-8nt2z                1/1     Running            0          19h
kube-system   calico-kube-controllers-648f4868b8-pkm6c   1/1     Running            28         43h
kube-system   calico-node-99mvf                          1/1     Running            15         43h
kube-system   calico-node-m5mmn                          1/1     Running            4          19h
kube-system   calico-node-p4b8t                          1/1     Running            0          18h
kube-system   calico-node-p8bp7                          1/1     Running            0          18h
kube-system   coredns-6955765f44-mrrzj                   1/1     Running            14         43h
kube-system   coredns-6955765f44-x4pqt                   1/1     Running            14         43h
kube-system   etcd-master-1                              1/1     Running            429        43h
kube-system   hcloud-csi-controller-0                    3/4     CrashLoopBackOff   11         31m
kube-system   hcloud-csi-node-9wsdp                      1/2     CrashLoopBackOff   10         31m
kube-system   hcloud-csi-node-kjzpj                      1/2     CrashLoopBackOff   10         31m
kube-system   hcloud-csi-node-vqjnh                      1/2     CrashLoopBackOff   10         31m
kube-system   kube-apiserver-master-1                    1/1     Running            383        43h
kube-system   kube-controller-manager-master-1           1/1     Running            23         43h
kube-system   kube-proxy-7smnt                           1/1     Running            0          18h
kube-system   kube-proxy-b8gz8                           1/1     Running            15         43h
kube-system   kube-proxy-vs2qk                           1/1     Running            13         43h
kube-system   kube-proxy-xnfw2                           1/1     Running            0          18h
kube-system   kube-scheduler-master-1                    1/1     Running            23         43h
  • kubectl get services:
Notebook:~ zejar$ kubectl get services --all-namespaces
NAMESPACE     NAME                            TYPE        CLUSTER-IP        EXTERNAL-IP   PORT(S)                  AGE
default       kubernetes                      ClusterIP   fd00:1234::1      <none>        443/TCP                  43h
kube-system   hcloud-csi-controller-metrics   ClusterIP   fd00:1234::e722   <none>        9189/TCP                 32m
kube-system   hcloud-csi-node-metrics         ClusterIP   fd00:1234::bca9   <none>        9189/TCP                 31m
kube-system   kube-dns                        ClusterIP   fd00:1234::a      <none>        53/UDP,53/TCP,9153/TCP   43h
  • kubectl get sc:
Notebook:~ zejar$ kubectl get sc
NAME                        PROVISIONER                                                RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
hcloud-volumes (default)    csi.hetzner.cloud                                          Delete          WaitForFirstConsumer   true                   33m
  • OS (from /etc/os-release):
root@worker-1:~# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • Kernel (from uname -a):
root@worker-1:~# uname -a
Linux worker-1 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux

zejar avatar Jan 09 '20 07:01 zejar

This works as intended. The metadata service (which is needed for the identification of the node) works only on IPv4.

LKaemmerling avatar Jan 09 '20 09:01 LKaemmerling

Hello there,

I'm hitting the same issue on my IPv6-only cluster.

@LKaemmerling I don't quite get why the metadata service is IPv4 only. Couldn't it listen to IPv6 as well?

lel-amri avatar May 27 '24 11:05 lel-amri

@lel-amri We might have a workaround, bypassing the metadata server all together. This will be implemented in the near future.

jooola avatar May 27 '24 14:05 jooola

Hello @jooola,

Okay, thanks for the feedback on this. I'm eager to see a solution for this.

In the meantime, I'm using the following workaround:

My cluster is running K3s v1.29.2+k3s1 with Cilium 0.15.2 as the CNI. What I did was to patch the HCloud CSI to allow metadata endpoint configuration, then I have set up a DaemonSet that runs a socat instance in the host network namespace that listens to a ULA address and relay the connections to 169.254.169.254:80.

Here are the details if this is of interest for someone

Build the patched hcloud-csi Docker image:

  • Clone https://github.com/hetznercloud/csi-driver/commit/8538bfec8a750d18788356bb61e15e66c5e4a7ec

  • Build the image:

    mkdir docker-build-context
    CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o ./docker-build-context/controller.bin ./cmd/controller/main.go
    CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o ./docker-build-context/node.bin ./cmd/node/main.go
    podman build -f ./Dockerfile -t devnull.superlel.me/hetznercloud/hcloud-csi-driver:v2.7.0-with-custom-metadata-endpoint ./docker-build-context/
    
  • Load the image to your nodes. I'm using podman image save --format oci-archive -o hcloud-csi.tar devnull.superlel.me/hetznercloud/hcloud-csi-driver:v2.7.0-with-custom-metadata-endpoint followed with an scp of hcloud-csi.tar to all the nodes, then a ctr image import hcloud-csi.tar from all the nodes.

I used the following helm chart values:

controller:
  extraEnvVars:
    - name: "HCLOUD_METADATA_ENDPOINT"
      value: "http://[fd96:7b7a:e945:3:6d65:7461:6461:7461]:13752/hetzner/v1/metadata"
  image:
    hcloudCSIDriver:
      name: "devnull.superlel.me/hetznercloud/hcloud-csi-driver:v2.7.0-with-custom-metadata-endpoint"
node:
  extraEnvVars:
    - name: "HCLOUD_METADATA_ENDPOINT"
      value: "http://[fd96:7b7a:e945:3:6d65:7461:6461:7461]:13752/hetzner/v1/metadata"
  image:
    hcloudCSIDriver:
      name: "devnull.superlel.me/hetznercloud/hcloud-csi-driver:v2.7.0-with-custom-metadata-endpoint"

Finally, I've made a Docker image that embeds socat and a script to setup the "6 -> 4 proxy" for Hetzner metadata service:

Dockerfile:

FROM docker.io/library/alpine:3.20#!/bin/sh
set -u
{ err=$(ip address add "$HETZNER_METADATA_SERVICE_PROXY64_LISTEN_ADDRESS"/128 dev cilium_host 2>&1 >&3 3>&-); } 3>&1
ret=$?
printf "%s\n" "$err" >&2
if [ $ret != 0 ] ; then
    case "$err" in
        *"RTNETLINK answers: File exists")
            exit 0
            ;;
    esac
fi
exit $ret

RUN apk add --no-cache socat
COPY --chmod=755 init.sh /init.sh
COPY --chmod=755 main.sh /main.sh
ENTRYPOINT ["/main.sh"]

init.sh:

#!/bin/sh
set -u
{ err=$(ip address add "$HETZNER_METADATA_SERVICE_PROXY64_LISTEN_ADDRESS"/128 dev cilium_host 2>&1 >&3 3>&-); } 3>&1
ret=$?
printf "%s\n" "$err" >&2
if [ $ret != 0 ] ; then
    case "$err" in
        *"RTNETLINK answers: File exists")
            exit 0
            ;;
    esac
fi
exit $ret

main.sh:

#!/bin/sh
exec socat -dd TCP6-LISTEN:"$HETZNER_METADATA_SERVICE_PROXY64_LISTEN_PORT",bind="$HETZNER_METADATA_SERVICE_PROXY64_LISTEN_ADDRESS",ipv6only=1,fork TCP4:169.254.169.254:80

I then built the image, then pushed it to the nodes. And finally I have used the following DaemonSet:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: hms-p64
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: hms-p64
  template:
    metadata:
      labels:
        name: hms-p64
    spec:
      tolerations:
        - key: node-role.kubernetes.io/control-plane
          operator: Exists
          effect: NoSchedule
      initContainers:
        - name: ip-address-add
          env:
            - name: "HETZNER_METADATA_SERVICE_PROXY64_LISTEN_ADDRESS"
              value: "fd96:7b7a:e945:3:6d65:7461:6461:7461"
            - name: "HETZNER_METADATA_SERVICE_PROXY64_LISTEN_PORT"
              value: "13752"
          image: devnull.superlel.me/hetzner-metadata-service-proxy64:latest
          imagePullPolicy: Never
          command:
            - /init.sh
          resources:
            limits:
              memory: 200Mi
            requests:
              cpu: 100m
              memory: 200Mi
          securityContext:
            capabilities:
              add:
                - NET_ADMIN
      containers:
        - name: hms-p64
          env:
            - name: "HETZNER_METADATA_SERVICE_PROXY64_LISTEN_ADDRESS"
              value: "fd96:7b7a:e945:3:6d65:7461:6461:7461"
            - name: "HETZNER_METADATA_SERVICE_PROXY64_LISTEN_PORT"
              value: "13752"
          image: devnull.superlel.me/hetzner-metadata-service-proxy64:latest
          imagePullPolicy: Never
          resources:
            limits:
              memory: 200Mi
            requests:
              cpu: 100m
              memory: 200Mi
      hostNetwork: true

lel-amri avatar Jun 01 '24 17:06 lel-amri

Any news?

I deployed a singlestack ipv6 cluster (talos) today and got the same failed to get instance id from metadata service error.

It seems dual stack is the only way atm, with private ipv4 network (required) and (at least) a ipv6 public interface - in this case it retrieves metadata via 100.64.0.0/10 (comes with eth0).

Having something like fe80::a9fe:a9fe/​128 as a ipv6 counterpart to 169.254.169.254 would solve so many issues! It would probably also eliminate the need for this cgnat hack.

Edit, forgot to mention the workaround at the time of posting - setting node.hostNetwork to true "fixes" the issue. Curiously, controller works on pod network out of the box, but if you use the invalid token it will print the above, misleading error.

miran248 avatar Oct 06 '24 15:10 miran248

Hello there, It looks like a4c985b9ca7180383723e9d514ad9b8f46006f15 could fix the issue.


Third edit. Nevermind that, this does not fix the issue, I though this was buried deeper in the business logic. Could we re-open this issue though?

lel-amri avatar Apr 26 '25 16:04 lel-amri

What about https://github.com/hetznercloud/csi-driver/compare/main...lel-amri:hcloud-csi:metadata-service-as-a-fallback?

lel-amri avatar Apr 27 '25 07:04 lel-amri

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

github-actions[bot] avatar Aug 08 '25 13:08 github-actions[bot]