helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

Issue with formatting drive after node restart in documented Google Kubernetes Engine deployment env

Open jason-da-redpanda opened this issue 1 year ago • 0 comments

What happened?

I followed Deploy a Redpanda Cluster in Google Kubernetes Engine with a local nvme + [controller decomission operator] (https://docs.redpanda.com/current/manage/kubernetes/k-decommission-brokers/#Automated) .. .where want to test how it behaved when bouncing a node .

.but have problems with volume when node/redpanda restarts....

pods ...

pods 
NAME                                            READY   STATUS             RESTARTS          AGE   IP          NODE                                     NOMINATED NODE   READINESS GATES
redpanda-0                                      1/2     Running            0                 26h   10.20.5.5   gke-jbarlow-default-pool-f442d692-cksn   <none>           <none>
redpanda-1                                      1/2     Running            0                 26h   10.20.3.9   gke-jbarlow-default-pool-96d13c61-rtib   <none>           <none>
redpanda-2                                      1/2     Running            0                 26h   10.20.2.8   gke-jbarlow-default-pool-5c14fe3d-cq4v   <none>           <none>
redpanda-console-54f5b46997-27d5v               1/1     Running            5 (26h ago)       26h   10.20.9.9   gke-jbarlow-default-pool-96d13c61-19cx   <none>           <none>
redpanda-controller-operator-66bd557695-k9l6n   2/2     Running            0                 18d   10.20.2.5   gke-jbarlow-default-pool-96d13c61-rtib  <none>           <none>

PVC bound with expected storageclass ...


kubectl get persistentvolumeclaim \
 --namespace jason-rp \
 -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,STORAGECLASS:.spec.storageClassName
NAME         STATUS  STORAGECLASS
datadir-redpanda-0  Bound  csi-driver-lvm-striped-xfs
datadir-redpanda-1  Bound  csi-driver-lvm-striped-xfs
datadir-redpanda-2  Bound  csi-driver-lvm-striped-xfs

When I delete the gks node .. the node is restarted.(as expected) . redpanda broker tries to restart but stays in PodInitializing ...because it fails when trying to format the disk .. (wrong device name_


node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Topology Spread Constraints: topology.kubernetes.io/zone:ScheduleAnyway when max skew 1 is exceeded for selector app.kubernetes.io/component=redpanda-statefulset,app.kubernetes.io/instance=redpanda,app.kubernetes.io/name=redpanda
Events:
 Type   Reason       Age          From        Message
 ----   ------       ----          ----        -------
 Warning FailedScheduling  3m22s         default-scheduler  0/9 nodes are available: 1 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}, 8 node(s) had volume node affinity conflict. preemption: 0/9 nodes are available: 9 Preemption is not helpful for scheduling..
 Normal  NotTriggerScaleUp 3m20s         cluster-autoscaler pod didn't trigger scale-up:
 Warning FailedScheduling  3m14s (x3 over 3m20s) default-scheduler  0/9 nodes are available: 9 node(s) had volume node affinity conflict. preemption: 0/9 nodes are available: 9 Preemption is not helpful for scheduling..
 Normal  Scheduled     3m5s          default-scheduler  Successfully assigned jason-rp/redpanda-2 to gke-jbarlow-default-pool-5c14fe3d-cq4v
 Warning FailedMount    2m1s (x8 over 3m5s)  kubelet       MountVolume.SetUp failed for volume "pvc-1c82c325-0351-4c60-a219-2868faa348a5" : rpc error: code = Unknown desc = unable to mount lv: unable to format lv:pvc-1c82c325-0351-4c60-a219-2868faa348a5 err:exit status 1 output:Error accessing specified device /dev/csi-lvm/pvc-1c82c325-0351-4c60-a219-2868faa348a5: No such file or directory
Usage: mkfs.xfs
/* blocksize */     [-b size=num]
/* config file */   [-c options=xxx]
/* metadata */      [-m crc=0|1,finobt=0|1,uuid=xxx,rmapbt=0|1,reflink=0|1,
                inobtcount=0|1,bigtime=0|1]
/* data subvol */   [-d agcount=n,agsize=n,file,name=xxx,size=num,
                (sunit=value,swidth=value|su=num,sw=num|noalign),
                sectsize=num
/* force overwrite */ [-f]
/* inode size */    [-i perblock=n|size=num,maxpct=n,attr=0|1|2,
                projid32bit=0|1,sparse=0|1]
/* no discard */    [-K]
/* log subvol */    [-l agnum=n,internal,size=num,logdev=xxx,version=n
                sunit=value|su=num,sectsize=num,lazy-count=0|1]
/* label */       [-L label (maximum 12 characters)]
/* naming */       [-n size=num,version=2|ci,ftype=0|1]
/* no-op info only */ [-N]
/* prototype file */  [-p fname]
/* quiet */       [-q]
/* realtime subvol */ [-r extsize=num,size=num,rtdev=xxx]
/* sectorsize */    [-s size=num]
/* version */      [-V]
              devicename
<devicename> is required unless -d name=xxx is given.
<num> is xxx (bytes), xxxs (sectors), xxxb (fs blocks), xxxk (xxx KiB),
   xxxm (xxx MiB), xxxg (xxx GiB), xxxt (xxx TiB) or xxxp (xxx PiB).
<value> is xxx (512 byte blocks).

I can get round it by deleting pvc/pod.. but kind of defeats the object..as based on docs..

storageclass should handle the format side of things..

What did you expect to happen?

The formatting should be handled correctly after a node restart scenario

(noting this maybe a csi issue... but the procedure is in our documentation)

How can we reproduce it (as minimally and precisely as possible)?. Please include values file.

follow Deploy a Redpanda Cluster in Google Kubernetes Engine with a local nvme

then restart one of the nodes a redpanda brokwr runs on

Anything else we need to know?

No response

Which are the affected charts?

No response

Chart Version(s)

$ helm -n <redpanda-release-namespace> list 
# paste output here

Cloud provider

google

JIRA Link: K8S-105

jason-da-redpanda avatar Feb 21 '24 11:02 jason-da-redpanda