Cannot add a new node to the Valkey cluster
Name and Version
bitnami/valkey-cluster:8.0.2-debian-12-r0
What architecture are you using?
amd64
What steps will reproduce the bug?
- On an RKE2 cluster, post setting up helm repo - I am trying to run the valkey chart - (valkey-cluster-2.1.1 ) following the Readme
- Command used to install the helmchart -
helm install sessiondata oci://registry-1.docker.io/bitnamicharts/valkey-cluster - It spawned 6 pods.
The pods look like this:
$ kubectl --namespace redis get pods
NAME READY STATUS RESTARTS AGE
sessiondata-valkey-cluster-0 2/2 Running 1 (11m ago) 12m
sessiondata-valkey-cluster-1 2/2 Running 1 (11m ago) 12m
sessiondata-valkey-cluster-2 2/2 Running 1 (11m ago) 12m
sessiondata-valkey-cluster-3 2/2 Running 1 (11m ago) 12m
sessiondata-valkey-cluster-4 2/2 Running 1 (11m ago) 12m
sessiondata-valkey-cluster-5 2/2 Running 1 (11m ago) 12m
When I tried to scale out the cluster by issuing helm upgrade --timeout 600s sessiondata --set "password=${VALKEY_PASSWORD},cluster.nodes=8,cluster.update.addNodes=true,cluster.update.currentNumberOfNodes=6" oci://REGISTRY_NAME/REPOSITORY_NAME/valkey-cluster, I expected to see the new pods becoming ready followed by the post-upgrade Helm hook which joins the pods to the cluster. Instead, the upgrade never finishes, because the Readiness probe returns an error.
$ kubectl --namespace redis get pods
NAME READY STATUS RESTARTS AGE
sessiondata-valkey-cluster-0 2/2 Running 1 (11m ago) 12m
sessiondata-valkey-cluster-1 2/2 Running 1 (11m ago) 12m
sessiondata-valkey-cluster-2 2/2 Running 1 (11m ago) 12m
sessiondata-valkey-cluster-3 2/2 Running 1 (11m ago) 12m
sessiondata-valkey-cluster-4 2/2 Running 1 (11m ago) 12m
sessiondata-valkey-cluster-5 2/2 Running 1 (11m ago) 12m
sessiondata-valkey-cluster-6 1/2 Running 0 10m
sessiondata-valkey-cluster-7 1/2 Running 0 10m
$ kubectl --namespace redis describe pod sessiondata-valkey-cluster-6
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 9m21s default-scheduler 0/11 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/11 nodes are available: 11 Preemption is not helpful for scheduling..
Warning FailedScheduling 9m19s default-scheduler 0/11 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/11 nodes are available: 11 Preemption is not helpful for scheduling..
Normal Scheduled 9m17s default-scheduler Successfully assigned redis/sessiondata-valkey-cluster-6 to review-istio-vm-tryout-worker5.local
Normal SuccessfulAttachVolume 9m7s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-d432896c-3902-4fd4-9ea1-96be58c2f0da"
Normal Pulled 9m6s kubelet Container image "docker.io/bitnami/valkey-cluster:8.0.2-debian-12-r0" already present on machine
Normal Created 9m6s kubelet Created container sessiondata-valkey-cluster
Normal Started 9m5s kubelet Started container sessiondata-valkey-cluster
Normal Pulled 9m5s kubelet Container image "docker.io/bitnami/redis-exporter:1.67.0-debian-12-r0" already present on machine
Normal Created 9m5s kubelet Created container metrics
Normal Started 9m5s kubelet Started container metrics
Warning Unhealthy 4m1s (x67 over 8m56s) kubelet Readiness probe failed: cluster_state:fail
Are you using any custom parameters or values?
Initial deployment:
cluster:
init: true
nodes: 6
replicas: 1
existingSecret: sessiondata-valkey-cluster-password
existingSecretPasswordKey: password
metrics:
enabled: true
serviceMonitor:
enabled: true
networkPolicy:
enabled: false
pdb:
create: true
maxUnavailable: 1
persistence:
size: 100Mi
storageClass: local-performance-noreplica-best-effort
valkey:
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
cpu: "10"
memory: "200Mi"
Values for the scale-out:
cluster:
init: true
nodes: 8
replicas: 1
update:
currentNumberOfNodes: 6
addNodes: true
existingSecret: sessiondata-valkey-cluster-password
existingSecretPasswordKey: password
metrics:
enabled: true
serviceMonitor:
enabled: true
networkPolicy:
enabled: false
pdb:
create: true
maxUnavailable: 1
persistence:
size: 100Mi
storageClass: local-performance-noreplica-best-effort
valkey:
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
cpu: "10"
memory: "200Mi"
What is the expected behavior?
The new pods should join the existing Valkey cluster
What do you see instead?
The new pods stay never become ready and show this output:
COPYING FILE
valkey-cluster 15:07:31.07 INFO ==>
valkey-cluster 15:07:31.08 INFO ==> Welcome to the Bitnami valkey-cluster container
valkey-cluster 15:07:31.08 INFO ==> Subscribe to project updates by watching https://github.com/bitnami/containers
valkey-cluster 15:07:31.08 INFO ==> Did you know there are enterprise versions of the Bitnami catalog? For enhanced secure software supply chain features, unlimited pulls from Docker, LTS support, or application customization, see Bitnami Premium or Tanzu Application Catalog. See https://www.arrow.com/globalecs/na/vendors/bitnami/ for more information.
valkey-cluster 15:07:31.08 INFO ==>
valkey-cluster 15:07:31.09 INFO ==> ** Starting Valkey setup **
valkey-cluster 15:07:31.11 INFO ==> Initializing Valkey
valkey-cluster 15:07:31.13 INFO ==> Setting Valkey config file
Storing map with hostnames and IPs
valkey-cluster 15:07:36.32 INFO ==> ** Valkey setup finished! **
1:C 04 Feb 2025 15:07:36.360 # WARNING: Changing databases number from 16 to 1 since we are in cluster mode
1:C 04 Feb 2025 15:07:36.361 # WARNING Your system is configured to use the 'xen' clocksource which might lead to degraded performance. Check the result of the [slow-clocksource] system check: run 'valkey-server --check-system' to check if the system's clocksource isn't degrading performance.
1:C 04 Feb 2025 15:07:36.361 * oO0OoO0OoO0Oo Valkey is starting oO0OoO0OoO0Oo
1:C 04 Feb 2025 15:07:36.361 * Valkey version=8.0.2, bits=64, commit=00000000, modified=1, pid=1, just started
1:C 04 Feb 2025 15:07:36.361 * Configuration loaded
1:M 04 Feb 2025 15:07:36.361 * monotonic clock: POSIX clock_gettime
.+^+.
.+#########+.
.+########+########+. Valkey 8.0.2 (00000000/1) 64 bit
.+########+' '+########+.
.########+' .+. '+########. Running in cluster mode
|####+' .+#######+. '+####| Port: 6379
|###| .+###############+. |###| PID: 1
|###| |#####*'' ''*#####| |###|
|###| |####' .-. '####| |###|
|###| |###( (@@@) )###| |###| https://valkey.io
|###| |####. '-' .####| |###|
|###| |#####*. .*#####| |###|
|###| '+#####| |#####+' |###|
|####+. +##| |#+' .+####|
'#######+ |##| .+########'
'+###| |##| .+########+'
'| |####+########+'
+#########+'
'+v+'
1:M 04 Feb 2025 15:07:36.362 * No cluster configuration found, I'm b047488e04a554340168578e461f4577998a34c9
1:M 04 Feb 2025 15:07:36.377 * Server initialized
1:M 04 Feb 2025 15:07:36.386 * Creating AOF base file appendonly.aof.1.base.rdb on server start
1:M 04 Feb 2025 15:07:36.401 * Creating AOF incr file appendonly.aof.1.incr.aof on server start
1:M 04 Feb 2025 15:07:36.401 * Ready to accept connections tcp
Hi @ilyapelyovin ,
I could not reproduce the issue on GKE (and without any custom storageClass). Could you please try to reproduce it without editing storageClass?
PS: I detected an issue with the job in charge of updating the cluster checking old IPs for already existing nodes. However, the cluster is properly updated. I have created a task to solve it.
Hi @dgomezleon , thank you for your prompt reply!
I tried to redeploy the chart using the default StorageClass (which is Longhorn in my case due to an on-prem Kubernetes cluster), the result doesn't seem to change.
Please find the results here:
Storage classes:
> kubectl --namespace redis get storageclasses.storage.k8s.io
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
local-performance driver.longhorn.io Delete Immediate true 30d
local-performance-best-effort driver.longhorn.io Delete Immediate true 26d
local-performance-best-effort-noreplica driver.longhorn.io Delete Immediate true 26d
local-performance-non-raid driver.longhorn.io Delete Immediate true 30d
local-performance-noreplica driver.longhorn.io Delete Immediate true 28d
local-performance-noreplica-best-effort driver.longhorn.io Delete Immediate true 3d3h
local-standard driver.longhorn.io Delete Immediate true 30d
longhorn (default) driver.longhorn.io Delete Immediate true 30d
longhorn-static driver.longhorn.io Delete Immediate true 30d
shared-performance driver.longhorn.io Delete Immediate true 30d
shared-standard driver.longhorn.io Delete Immediate true 30d
Initial Helm values:
cluster:
init: true
nodes: 6
replicas: 1
#update:
# currentNumberOfNodes: 6
# addNodes: true
existingSecret: sessiondata-valkey-cluster-password
existingSecretPasswordKey: password
metrics:
enabled: true
serviceMonitor:
enabled: true
networkPolicy:
enabled: false
pdb:
create: true
maxUnavailable: 1
persistence:
size: 100Mi
#storageClass: local-performance-noreplica-best-effort
valkey:
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
cpu: "10"
memory: "200Mi"
Initial deployment:
> kubectl --namespace redis get pods
NAME READY STATUS RESTARTS AGE
sessiondata-valkey-cluster-0 2/2 Running 1 (118s ago) 2m23s
sessiondata-valkey-cluster-1 2/2 Running 2 (101s ago) 2m23s
sessiondata-valkey-cluster-2 2/2 Running 0 2m23s
sessiondata-valkey-cluster-3 2/2 Running 0 2m23s
sessiondata-valkey-cluster-4 2/2 Running 2 (115s ago) 2m23s
sessiondata-valkey-cluster-5 2/2 Running 1 (2m1s ago) 2m23s
Persistent volume claims (default StorageClass is used):
> kubectl --namespace redis get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
valkey-data-sessiondata-valkey-cluster-0 Bound pvc-d772c278-826e-4d24-a2d1-75b0c96f35c0 100Mi RWO longhorn 3m56s
valkey-data-sessiondata-valkey-cluster-1 Bound pvc-c9230845-b8c3-4bfd-9e91-4763270e34ec 100Mi RWO longhorn 3m56s
valkey-data-sessiondata-valkey-cluster-2 Bound pvc-a7b81e24-7652-428e-95b8-80cecde4151c 100Mi RWO longhorn 3m56s
valkey-data-sessiondata-valkey-cluster-3 Bound pvc-928a8275-dd1c-42de-b2b4-392a360cf294 100Mi RWO longhorn 3m56s
valkey-data-sessiondata-valkey-cluster-4 Bound pvc-9d5a241d-287f-4be8-94a5-8957afa05735 100Mi RWO longhorn 3m56s
valkey-data-sessiondata-valkey-cluster-5 Bound pvc-a0a3bc7f-352a-465d-9f22-08310f920319 100Mi RWO longhorn 3m56s
Helm values for the cluster scale-out:
cluster:
init: true
nodes: 8
replicas: 1
update:
currentNumberOfNodes: 6
addNodes: true
existingSecret: sessiondata-valkey-cluster-password
existingSecretPasswordKey: password
metrics:
enabled: true
serviceMonitor:
enabled: true
networkPolicy:
enabled: false
pdb:
create: true
maxUnavailable: 1
persistence:
size: 100Mi
#storageClass: local-performance-noreplica-best-effort
valkey:
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
cpu: "10"
memory: "200Mi"
Scaled deployment:
> kubectl --namespace redis get pods
NAME READY STATUS RESTARTS AGE
sessiondata-valkey-cluster-0 2/2 Running 1 (9m13s ago) 9m38s
sessiondata-valkey-cluster-1 2/2 Running 2 (8m56s ago) 9m38s
sessiondata-valkey-cluster-2 2/2 Running 0 9m38s
sessiondata-valkey-cluster-3 2/2 Running 0 9m38s
sessiondata-valkey-cluster-4 2/2 Running 2 (9m10s ago) 9m38s
sessiondata-valkey-cluster-5 2/2 Running 1 (9m16s ago) 9m38s
sessiondata-valkey-cluster-6 1/2 Running 1 (2m22s ago) 2m53s
sessiondata-valkey-cluster-7 1/2 Running 0 2m53s
Describe pod 6:
> kubectl --namespace redis describe pod sessiondata-valkey-cluster-6
Name: sessiondata-valkey-cluster-6
Namespace: redis
Priority: 0
Service Account: sessiondata-valkey-cluster
Node: review-istio-vm-tryout-worker4.local/10.0.8.126
Start Time: Fri, 07 Feb 2025 12:53:43 +0000
Labels: app.kubernetes.io/instance=sessiondata
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=valkey-cluster
app.kubernetes.io/version=8.0.2
controller-revision-hash=sessiondata-valkey-cluster-56f997fbdb
helm.sh/chart=valkey-cluster-2.1.1_27f116c09962
statefulset.kubernetes.io/pod-name=sessiondata-valkey-cluster-6
Annotations: checksum/config: 6cb547e0d05a427ee7873bea46f7e048d473560cefde499d256e0a7cbe170680
checksum/scripts: 8af2bd04e554a88c9947511115212060578b2a020e472e8caff12e8d4e392b46
cni.projectcalico.org/containerID: 381d48f94bc0270144ba58d80b1c579ebe4b1206ca8bdff98605685981d05e2c
cni.projectcalico.org/podIP: 10.42.10.68/32
cni.projectcalico.org/podIPs: 10.42.10.68/32
prometheus.io/port: 9121
prometheus.io/scrape: true
Status: Running
IP: 10.42.10.68
IPs:
IP: 10.42.10.68
Controlled By: StatefulSet/sessiondata-valkey-cluster
Containers:
sessiondata-valkey-cluster:
Container ID: containerd://65530e762dc042d042172c23f42fc6e9ba4f5a14fa2a0df847ac75c339797582
Image: docker.io/bitnami/valkey-cluster:8.0.2-debian-12-r0
Image ID: docker.io/bitnami/valkey-cluster@sha256:6a8d18eb957fbae040d0351651e1cb24309eb6ee95fc6c8e98bdc30836fa3ae1
Ports: 6379/TCP, 16379/TCP
Host Ports: 0/TCP, 0/TCP
SeccompProfile: RuntimeDefault
Command:
/bin/bash
-c
Args:
# Backwards compatibility change
if ! [[ -f /opt/bitnami/valkey/etc/valkey.conf ]]; then
echo COPYING FILE
cp /opt/bitnami/valkey/etc/valkey-default.conf /opt/bitnami/valkey/etc/valkey.conf
fi
pod_index=($(echo "$POD_NAME" | tr "-" "\n"))
pod_index="${pod_index[-1]}"
if [[ "$pod_index" == "0" ]]; then
export VALKEY_CLUSTER_CREATOR="yes"
export VALKEY_CLUSTER_REPLICAS="1"
fi
/opt/bitnami/scripts/valkey-cluster/entrypoint.sh /opt/bitnami/scripts/valkey-cluster/run.sh
State: Running
Started: Fri, 07 Feb 2025 12:54:11 +0000
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Fri, 07 Feb 2025 12:53:55 +0000
Finished: Fri, 07 Feb 2025 12:54:10 +0000
Ready: False
Restart Count: 1
Limits:
cpu: 10
memory: 200Mi
Requests:
cpu: 100m
memory: 128Mi
Liveness: exec [sh -c /scripts/ping_liveness_local.sh 5] delay=5s timeout=6s period=5s #success=1 #failure=5
Readiness: exec [sh -c /scripts/ping_readiness_local.sh 1] delay=5s timeout=2s period=5s #success=1 #failure=5
Environment:
POD_NAME: sessiondata-valkey-cluster-6 (v1:metadata.name)
VALKEY_NODES: sessiondata-valkey-cluster-0.sessiondata-valkey-cluster-headless sessiondata-valkey-cluster-1.sessiondata-valkey-cluster-headless sessiondata-valkey-cluster-2.sessiondata-valkey-cluster-headless sessiondata-valkey-cluster-3.sessiondata-valkey-cluster-headless sessiondata-valkey-cluster-4.sessiondata-valkey-cluster-headless sessiondata-valkey-cluster-5.sessiondata-valkey-cluster-headless sessiondata-valkey-cluster-6.sessiondata-valkey-cluster-headless sessiondata-valkey-cluster-7.sessiondata-valkey-cluster-headless
REDISCLI_AUTH: <set to the key 'password' in secret 'sessiondata-valkey-cluster-password'> Optional: false
VALKEY_PASSWORD: <set to the key 'password' in secret 'sessiondata-valkey-cluster-password'> Optional: false
VALKEY_AOF_ENABLED: yes
VALKEY_TLS_ENABLED: no
VALKEY_PORT_NUMBER: 6379
Mounts:
/bitnami/valkey/data from valkey-data (rw)
/opt/bitnami/valkey/etc/ from empty-dir (rw,path="app-conf-dir")
/opt/bitnami/valkey/etc/valkey-default.conf from default-config (rw,path="valkey-default.conf")
/opt/bitnami/valkey/logs from empty-dir (rw,path="app-logs-dir")
/opt/bitnami/valkey/tmp from empty-dir (rw,path="app-tmp-dir")
/scripts from scripts (rw)
/tmp from empty-dir (rw,path="tmp-dir")
metrics:
Container ID: containerd://3bef8600accfb515daf1aee8ad7725e97830dfd4ffd5e6d187a7aabb201dd071
Image: docker.io/bitnami/redis-exporter:1.67.0-debian-12-r0
Image ID: docker.io/bitnami/redis-exporter@sha256:5bc3229b94f62b593600ee74d0cd16c7a74df31852eb576bdc0f5e663c8e1337
Port: 9121/TCP
Host Port: 0/TCP
SeccompProfile: RuntimeDefault
Command:
/bin/bash
-c
redis_exporter
State: Running
Started: Fri, 07 Feb 2025 12:53:55 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 150m
ephemeral-storage: 2Gi
memory: 192Mi
Requests:
cpu: 100m
ephemeral-storage: 50Mi
memory: 128Mi
Environment:
BITNAMI_DEBUG: false
REDIS_ALIAS: sessiondata-valkey-cluster
REDIS_ADDR: redis://127.0.0.1:6379
REDIS_PASSWORD: <set to the key 'password' in secret 'sessiondata-valkey-cluster-password'> Optional: false
REDIS_EXPORTER_WEB_LISTEN_ADDRESS: :9121
Mounts: <none>
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
valkey-data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: valkey-data-sessiondata-valkey-cluster-6
ReadOnly: false
scripts:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: sessiondata-valkey-cluster-scripts
Optional: false
default-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: sessiondata-valkey-cluster-default
Optional: false
empty-dir:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m23s default-scheduler 0/11 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/11 nodes are available: 11 Preemption is not helpful for scheduling..
Warning FailedScheduling 3m21s default-scheduler 0/11 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/11 nodes are available: 11 Preemption is not helpful for scheduling..
Normal Scheduled 3m19s default-scheduler Successfully assigned redis/sessiondata-valkey-cluster-6 to review-istio-vm-tryout-worker4.local
Normal SuccessfulAttachVolume 3m9s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-c2b1acdb-fe0a-4f51-952d-26dc5602e804"
Normal Started 3m7s kubelet Started container metrics
Normal Pulled 3m7s kubelet Container image "docker.io/bitnami/redis-exporter:1.67.0-debian-12-r0" already present on machine
Normal Created 3m7s kubelet Created container metrics
Normal Created 2m51s (x2 over 3m8s) kubelet Created container sessiondata-valkey-cluster
Normal Started 2m51s (x2 over 3m7s) kubelet Started container sessiondata-valkey-cluster
Normal Pulled 2m51s (x2 over 3m8s) kubelet Container image "docker.io/bitnami/valkey-cluster:8.0.2-debian-12-r0" already present on machine
Warning Unhealthy 2m43s (x3 over 2m58s) kubelet Readiness probe failed: Could not connect to Valkey at localhost:6379: Connection refused
Warning Unhealthy 2m43s (x3 over 2m58s) kubelet Liveness probe failed: Could not connect to Valkey at localhost:6379: Connection refused
Warning Unhealthy 118s (x10 over 2m38s) kubelet Readiness probe failed: cluster_state:fail
Thanks for sharing this info.
I was able to reproduce it with 12 nodes, so maybe related to the cluster size. I will add it to the task.
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.
Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.
Hi @dgomezleon , has this been erroneously closed by stale bot?