k3s [Release-1.27] - k3s etcd-snapshot save fails on host with IPv6 only

Backport fix for Fix on-demand snapshots on ipv6-only nodes

#9214

Feb 07 '24 00:02 brandond

Validated on Version:

-$ k3s version v1.27.11+k3s-11b31c28 (11b31c28)

Environment Details

Infrastructure Cloud EC2 instance

Node(s) CPU architecture, OS, and Version: SUSE Linux Enterprise Server 15 SP4

Cluster Configuration: 1 server node

Steps to validate the fix

Install k3s with node ipv6 only with args on config, not CLI

 k3s.io/node-args: '["server","--cluster-cidr","2001:cafe:42::/56","--service-cidr","2001:cafe:43::/108","--cluster-init","true","--node-ip","2600:1f1c:ab4:ee32:c44c:a8b3:4319:dad7","--write-kubeconfig-mode","644"]'

Validate that etcd snapshot is working fine
Validate nodes and pods

Reproduction Issue:

 
  k3s -v
k3s version v1.29.1+k3s-8224a3a7 (8224a3a7)
go version go1.21.6``

 kubectl get node -o yaml | grep node-args
      k3s.io/node-args: '["server","--cluster-cidr","2001:cafe:42::/56","--service-cidr","2001:cafe:43::/108","--cluster-init","true","--node-ip","2600:1f1c:ab4:ee32:c44c:a8b3:4319:dad7","--write-kubeconfig-mode","644"]'

 sudo k3s etcd-snapshot save 
WARN[0000] Unknown flag --cluster-cidr found in config.yaml, skipping 
WARN[0000] Unknown flag --service-cidr found in config.yaml, skipping 
WARN[0000] Unknown flag --cluster-init found in config.yaml, skipping 
WARN[0000] Unknown flag --write-kubeconfig-mode found in config.yaml, skipping 
^C{"level":"warn","ts":"2024-02-15T19:39:34.996891Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00136e000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Canceled desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
{"level":"warn","ts":"2024-02-15T19:39:34.996862Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00136e000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Canceled desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}

Validation Results:

       
$ k3s -v
k3s version v1.27.11+k3s-11b31c28 (11b31c28)
go version go1.21.7



 $ kubectl get nodes,pods -A
NAME                       STATUS   ROLES                       AGE   VERSION
node/i    Ready    control-plane,etcd,master   27s   v1.27.11+k3s-11b31c28

NAMESPACE     NAME                                          READY   STATUS    RESTARTS   AGE
kube-system   pod/coredns-77ccd57875-ddmj4                  1/1     Running   0          12s
kube-system   pod/helm-install-traefik-crd-bf757            1/1     Running   0          12s
kube-system   pod/helm-install-traefik-n9dsv                1/1     Running   0          12s
kube-system   pod/local-path-provisioner-79ffd768b5-vrj6t   1/1     Running   0          12s
kube-system   pod/metrics-server-648b5df564-vhmkf           0/1     Running   0          12s



 kubectl get node -o yaml | grep node-args
      k3s.io/node-args: '["server","--cluster-cidr","2001:cafe:42::/56","--service-cidr","2001:cafe:43::/108","--cluster-init","true","--node-ip","2600:1f1c:ab4:ee32:c44c:a8b3:4319:dad7","--write-kubeconfig-mode","644"]'

sudo k3s etcd-snapshot save

WARN[0000] Unknown flag --cluster-cidr found in config.yaml, skipping 
WARN[0000] Unknown flag --service-cidr found in config.yaml, skipping 
WARN[0000] Unknown flag --cluster-init found in config.yaml, skipping 
WARN[0000] Unknown flag --node-ip found in config.yaml, skipping 
WARN[0000] Unknown flag --write-kubeconfig-mode found in config.yaml, skipping 
INFO[0000] Saving etcd snapshot to /var/lib/rancher/k3s/server/db/snapshots/on-demand-i-041ae49edb4c36e85-1708100498 
{"level":"info","ts":"2024-02-16T16:21:37.859227Z","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/k3s/server/db/snapshots/on-demand-i-041ae49edb4c36e85-1708100498.part"}
{"level":"info","ts":"2024-02-16T16:21:37.861482Z","logger":"client","caller":"[email protected]/maintenance.go:212","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":"2024-02-16T16:21:37.861588Z","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://[::1]:2379"}
{"level":"info","ts":"2024-02-16T16:21:37.926281Z","logger":"client","caller":"[email protected]/maintenance.go:220","msg":"completed snapshot read; closing"}
{"level":"info","ts":"2024-02-16T16:21:37.936413Z","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://[::1]:2379","size":"3.0 MB","took":"now"}
{"level":"info","ts":"2024-02-16T16:21:37.936512Z","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/lib/rancher/k3s/server/db/snapshots/on-demand-i-041ae49edb4c36e85-1708100498"}
INFO[0000] Reconciling ETCDSnapshotFile resources       
INFO[0000] Reconciliation of ETCDSnapshotFile resources complete

Feb 16 '24 17:02 fmoral2

Working as expected using config but not with args on CLI, talking with @brandond we are letting this behind now to release the whole fix

Feb 16 '24 17:02 fmoral2

Moving out to next release to extend fix to CLI args, not just config.

Feb 17 '24 00:02 brandond

Validated on Version:

-$ k3s version v1.27.12+k3s-2d48b196 (2d48b196)

Environment Details

Infrastructure Cloud EC2 instance

Node(s) CPU architecture, OS, and Version: SUSE Linux Enterprise Server 15 SP4

Cluster Configuration:

SPLIT ROLE
1 SERVER
2 CP ONLY
2 ETCD ONLY
2 WORKERS

Steps to validate the fix

create a cluster
take etcd snapshot
validate new outputs
Restore
Validate restore
Validate nodes
Validate pods

Reproduction Issue:

Validation Results:


$ sudo  k3s etcd-snapshot save --etcd-s3
FATA[0000] see server log for details: s3 bucket name was not set 



$ sudo k3s etcd-snapshot save \
--s3 \
--s3-bucket=" " \
--s3-access-key=" " \
--s3-secret-key=" + + " \
--s3-region="us-east-2" \
--s3-timeout=90s
INFO[0002] Snapshot on-demand-ip-172-31-7-127.us-east-2.compute.internal-1713184771 saved. 


$ sudo k3s server \
--cluster-reset \
--etcd-s3 \
--cluster-reset-restore-path=" " \
--etcd-s3-bucket=" " \
--etcd-s3-region=us-east-2 \
--etcd-s3-access-key=" " \
--etcd-s3-secret-key=" " 


  INFO[0014] Managed etcd cluster membership has been reset, restart without --cluster-reset flag now. Backup and delete ${datadir}/server/db on each peer etcd server and rejoin the nodes 

       
 $ kubectl get nodes
NAME                                          STATUS   ROLES                       AGE   VERSION
ip- .us-east-2.compute.internal     Ready    etcd                        45m   v1.27.12+k3s-2d48b196
ip- .us-east-2.compute.internal     Ready    control-plane,master        44m   v1.27.12+k3s-2d48b196
ip- .us-east-2.compute.internal   Ready    <none>                      43m   v1.27.12+k3s-2d48b196
ip- .us-east-2.compute.internal   Ready    <none>                      43m   v1.27.12+k3s-2d48b196
ip- .us-east-2.compute.internal   Ready    control-plane,master        44m   v1.27.12+k3s-2d48b196
ip- .us-east-2.compute.internal    Ready    <none>                      42m   v1.27.12+k3s-2d48b196
ip- .us-east-2.compute.internal    Ready    etcd                        44m   v1.27.12+k3s-2d48b196
ip- .us-east-2.compute.internal    Ready    control-plane,etcd,master   47m   v1.27.12+k3s-2d48b196
ip-1     

                           Ready    control-plane,etcd,master   18s   v1.27.12+k3s-2d48b196
 ~$ kubectl get pods -A
NAMESPACE     NAME                                      READY   STATUS      RESTARTS   AGE
kube-system   coredns-77ccd57875-g69xw                  1/1     Running     0          47m
kube-system   helm-install-traefik-56hr2                0/1     Completed   1          47m
kube-system   helm-install-traefik-crd-kbb4h            0/1     Completed   0          47m
kube-system   local-path-provisioner-79ffd768b5-dpv4z   1/1     Running     0          47m
kube-system   metrics-server-c44988498-ssdqv            1/1     Running     0          47m
kube-system   svclb-traefik-737736ff-6vb2t              2/2     Running     0          43m
kube-system   svclb-traefik-737736ff-d5bxl              2/2     Running     0          43m
kube-system   svclb-traefik-737736ff-ff2jq              2/2     Running     0          47m
kube-system   svclb-traefik-737736ff-n8qs7              2/2     Running     0          42m
kube-system   svclb-traefik-737736ff-pr7pq              2/2     Running     0          44m
kube-system   svclb-traefik-737736ff-wfpgq              2/2     Running     0          45m
kube-system   traefik-7d5c94d587-4ns9b                  1/1     Running     0          47m

Apr 15 '24 13:04 fmoral2