cass-operator K8SSAND-863 ⁃ Replacing nodes

What did you do? Running a 3 nodes/1 node per AZ cluster on EKS with K8ssandra 1.3.1, I terminated a Kubernetes node in AZ a that was running a k8ssandra managed Cassandra pod, using a local volume. A new node was added to Kubernetes cluster after the old node died.

Because the volume is local the pod became unschedulable.

I added annotation volumehealth.storage.kubernetes.io/health: inaccessible on the PVC following https://github.com/datastax/cass-operator/pull/224. Nothing happened.

I removed the PVC manually. A new PVC and corresponding volume has been automatically created on the new node.The pod was rescheduled on the new node. The pod keeps Unhealhy. Logs show that it tries to contact the old pod, but it get a timeout as the old pod is dead

io.netty.channel.ConnectTimeoutException: connection timed out: /10.220.35.109:7000

I edited the cassandradatacenters.cassandra.datastax.com resource filling .spec.replaceNodes: ["k8ssandra-test-eu-west-1a-sts-0"] following https://github.com/k8ssandra/cass-operator/blob/5ed8c3733b665524636e8526d9cb2d017e802b16/pkg/reconciliation/reconcile_racks.go#L1059 and https://github.com/k8ssandra/cass-operator/issues/78. Looking at the updated resource, I see that the values from .spec.replaceNodes has moved to .status.nodeReplacements as expected. Nothing more happened (no log in the operator)

Querying cluster status using nodetool form an healthy pod, I can see

Datacenter: test
================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load        Tokens  Owns (effective)  Host ID                               Rack      
UJ  10.220.33.223  1.89 MiB    16      ?                 8619c8c9-6a6b-4d36-90a6-d3e9108823e5  eu-west-1a
UN  10.220.41.145  196.86 KiB  16      100.0%            2520f181-bd45-485e-a2ae-7fba73a5f3de  eu-west-1c
DN  10.220.35.109  177.62 KiB  16      100.0%            d082a3f7-29b8-4e5c-a204-e4fd99dbd497  eu-west-1a
UN  10.220.37.169  252.57 KiB  16      100.0%            ee679f79-2975-4517-90d6-8f0e2935c765  eu-west-1b

The new pod stays with Events

 Warning  Unhealthy  46s (x177 over 30m)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 500

exec in the new Cassandra pod and ps auxwww never show a try to start with --replace_address

Issuing nodetool removenode d082a3f7-29b8-4e5c-a204-e4fd99dbd497 makes the new pod READY. But the CassandraDatacenter resource is wrong, the nodeReplacements is empty but the status has kept the old ID, this is inconsistent.

  nodeReplacements: []
  nodeStatuses:
    k8ssandra-test-eu-west-1a-sts-0:
      hostID: d082a3f7-29b8-4e5c-a204-e4fd99dbd497

I also tried some variations around this, nothing triggered a replacement.

Given I got no log and there is no documentation, I'm a bit out of ideas and help would be welcome.

Did you expect to see some different? I expected some documentation about replacing nodes. I expected replacing dead nodes would be one of the most straightforward feature of an operator as this is something you have to do at some point. I expected to get error messages or progress messages. I got none.

Environment EKS 1.21, K8ssandra 1.3.1, cass-operator 1.7.1, local storage using rancher/local-path-provisioner:v0.0.20

Manifests:

  cassandra:
        cassandraLibDirVolume:
          storageClass: local
        datacenters:
          - name: test
            size: 3
            racks:
              - name: eu-west-1a
                affinityLabels:
                  topology.kubernetes.io/zone: eu-west-1a
              - name: eu-west-1b
                affinityLabels:
                  topology.kubernetes.io/zone: eu-west-1b
              - name: eu-west-1c
                affinityLabels:
                  topology.kubernetes.io/zone: eu-west-1c
      stargate:
        enabled: false
      medusa:
        enabled: false

Cass Operator Logs:

N/A

┆Issue is synchronized with this Jira Task by Unito ┆friendlyId: K8SSAND-863 ┆priority: Medium

Sep 02 '21 14:09 github-vincent-miszczak

Is this the same as #110 ? That one was fixed in #141

Sep 02 '21 15:09 burmanm

@burmanm, no because I'm not supposed to issue a nodetool removenode in the first place, but if I do, the result is inconsistent, so it's a bit related, but not the same at all.

Sep 02 '21 15:09 github-vincent-miszczak

I added annotation volumehealth.storage.kubernetes.io/health: inaccessible on the PVC following datastax/cass-operator#224. Nothing happened

That annotation is not part of the normal Kubernetes, instead it's a vSphere Kubernetes distribution only feature. The replaceNode should have been the only thing to be used in this case.

Sep 02 '21 15:09 burmanm

@github-vincent-miszczak can you share your cass-operator logs?

Sep 02 '21 15:09 jsanda

I recreated a new cluster to be clean. From a running cluster, I killed Kube node in AZ a. State was

k8ssandra-test-eu-west-1a-sts-0                     0/2     Pending   0          5m19s
k8ssandra-test-eu-west-1b-sts-0                     2/2     Running   0          15m
k8ssandra-test-eu-west-1c-sts-0                     2/2     Running   0          15m

Pod was pending because of local storage as expected:

Warning  FailedScheduling   57s (x6 over 6m56s)   default-scheduler   0/3 nodes are available: 1 node(s) had volume node affinity conflict, 2 node(s) didn't match Pod's node affinity/selector.

Then I issued kubectl edit cassandradatacenters.cassandra.datastax.com test and added .spec.replaceNodes: ["k8ssandra-test-eu-west-1a-sts-0"], checking the file, I got then

 nodeReplacements:
  - k8ssandra-test-eu-west-1a-sts-0
  nodeStatuses:
    k8ssandra-test-eu-west-1a-sts-0:
      hostID: f41a0a54-9822-4bae-bfa0-94fec9c8a6b4
    k8ssandra-test-eu-west-1b-sts-0:
      hostID: 1f161ac4-8586-4972-be28-5819895379dd
    k8ssandra-test-eu-west-1c-sts-0:
      hostID: a78bd14a-52f8-4254-acc5-854cffa828d6

Nothing happened.

You'll find operator logs attached. It begins after it restarted (it was running on AZ a that I killed so it was rescheduled) up to after checking the replacement is in place in CassandraDatacenter and after waiting some time (nothing more happened by the time I wrote this comment)

Log: replace.log

Sep 02 '21 16:09 github-vincent-miszczak

@burmanm, I understand I won't be able to use it at the moment, but the annotation system looked nice for my use case: I will have nodes dying regularly, I don't want to care about them, I expect that a program can handle this for me. It would have been easy to make a controller that put annotations on unrecoverable volumes and the corresponding nodes would have been replaced automatically.

I guess I can do the same with the current system: get the CassandraDatacenter resource, deduce unschedulable pods because of volume/node and patch the resource if necessary. WDYT?

Do you guys have code/experience to share with replacing nodes automatically?

Sep 02 '21 16:09 github-vincent-miszczak

How would you detect unrecoverable PV / PVC ? We could certainly add such detection to cass-operator to replace nodes that will never be up again.

Sep 02 '21 18:09 burmanm

It would have been easy to make a controller that put annotations on unrecoverable volumes

What criteria would you use to determine that a volume unrecoverable?

I suppose that if you are using local storage and if the k8s worker is completely removed, then we would know that the volume is gone. Aside from that I am not sure how we could safely and reliably make that determination. Certainly open to ideas though.

Sep 03 '21 02:09 jsanda

My very initial thoughts on determining unrecoverable storage:

get lists of volumes grouped by node. All my local volumes are annotated with the corresponding nodes for scheduling. It could eventually be refined with other metadata if the Kube cluster handles other stuff
get the list of nodes in the Kube cluster
for each list of volumes by node, if the node do not exist in the list of Kube nodes, then corresponding volumes are unrecoverable

Example of PV affinitity:

Node Affinity:     
  Required Terms:  
    Term 0:        kubernetes.io/hostname in [XXXX]

This does not has to run into the operator(but it may). Using volume annotations allows splitting of the logic and users could have something that better fit their needs if wanted by just having a way to put an annotation or not.

From there, pushing an annotation on unrecoverable volumes sounds like the easiest way to go for me. The operator has an opportunity the run a playbook:

look for volumes marked unrecoverable
find corresponding pods
push the corresponding pods into the CassandraDatacenter nodeReplacements list
remove corresponding PVC if any
remove PV
play the replaceNodes logic as usual

I'm new to the code base, but a quick look on it, this part looks already done: https://github.com/k8ssandra/cass-operator/blob/master/pkg/reconciliation/check_nodes.go#L168

Then standard Kube scheduling does the rest of the job:

create a new PV/PVC on a new node
schedule the replacement pod on new node

And job should be done

I'm confused because all of the logic looks to be present in the code, but somehow using volume annotation is from @burmanm

a vSphere Kubernetes distribution only feature

and I'm suggested to re-make this logic on my side manipulating CassandraDacenter replaceNodes myself and also manage PV/PVC removal.

Is there something wrong with the already existing code? Why keeping it for vSphere? Is the pattern wrong or worst than doing everything "manually" ? Is there a simple way to trick the system and make it think I'm running vSphere? What would be the side effects?

Sep 03 '21 08:09 github-vincent-miszczak

I made a mistake when sharing my logs yesterday. I did not removed the PVC manually after declaring .spec.ReplaceNodes (but I did do this during all my previous tests) and so the pod was not rescheduled and some logs are missing. After removing the PVC, the logs show

{"level":"info","ts":1630661961.1201,"logger":"reconciliation_handler","msg":"calling Management API start node - POST /api/v0/lifecycle/start","requestNamespace":"default","requestName":"test","loopID":"15485a37-afcd-4bda-985e-0a80bde6ec82","namespace":"default","datacenterName":"test","clusterName":"k8ssandra","pod":"k8ssandra-test-eu-west-1a-sts-0","podIP":"10.220.33.138","replaceIP":""}

I did not pay attention during my first tests, but I think the cause of the issue is there: replaceIP is for some reason empty and so the new pod never try to replace the old one in the ring.

So it seems that for some reason https://github.com/k8ssandra/cass-operator/blob/master/pkg/reconciliation/reconcile_racks.go#L1757 returns an empty IP address.

I've been reading https://github.com/jetstack/navigator/issues/319 and it suggest that hostid can be used. If that's the case, there is no need for trying to get an IP address, the old hostID is already known.

I'll dig into this.

Sep 03 '21 10:09 github-vincent-miszczak

for each list of volumes by node, if the node do not exist in the list of nodes, then corresponding volumes are unrecoverable

Are you referring to Cassandra nodes or k8s worker nodes? I assume the latter but want to be sure. If it is the latter then I am I correct in thinking that the criteria for determining a volume is unrecoverable is the worker node being gone?

Sep 03 '21 14:09 jsanda

Are you referring to Cassandra nodes or k8s worker nodes?

Kube nodes from the previous statement. I edited to make this clear. In my case with ephemeral storage at least, having the node gone means the data is also gone. By gone, I mean the node is no more in the list returned by kubectl get nodes for instance.

There are cases where the node is not ready, but that do not mean gone.

Sep 03 '21 14:09 github-vincent-miszczak

Thanks for the clarification. The operator could certainly handle this scenario.

If it is common for k8s nodes to be removed would network storage be better/easier?

Sep 03 '21 14:09 jsanda

We run high IOPS and ultra low latency workloads. For those, network storage is not an option. We have also other workloads that can accommodate running network storage but that's not the point of this issue.

Sep 03 '21 15:09 github-vincent-miszczak

Digging into empty replacement IP address issue, I ran patched versions of management-api to remove the check for a valid IPv4 address and patched version of cass-operator to directly use the hostID instead of its (empty) IP address. Cassandra does not want a hostID

INFO  [main] 2021-09-03 13:48:45,784 CassandraDaemon.java:640 - JVM Arguments: [-XX:+UnlockDiagnosticVMOptions, -XX:+AlwaysPreTouch, -Dcassandra.disable_auth_caches_remote_configuration=false, -Dcassandra.force_default_indexing_page_size=false, -Dcassandra.join_ring=true, -Dcassandra.load_ring_state=true, -Dcassandra.write_survey=false, -XX:+DebugNonSafepoints, -ea, -XX:GuaranteedSafepointInterval=300000, -XX:+HeapDumpOnOutOfMemoryError, -Dio.netty.eventLoop.maxPendingTasks=65536, -Djava.net.preferIPv4Stack=true, -Djdk.nio.maxCachedBufferSize=1048576, -Dsun.nio.PageAlignDirectMemory=true, -Xss256k, -XX:+PerfDisableSharedMem, -XX:+PreserveFramePointer, -Dcassandra.printHeapHistogramOnOutOfMemoryError=false, -XX:+ResizeTLAB, -XX:-RestrictContended, -XX:StringTableSize=1000003, -XX:-UseBiasedLocking, -XX:+UseNUMA, -XX:+UseThreadPriorities, -XX:+UseTLAB, -Dcom.sun.management.jmxremote.authenticate=false, -Dcassandra.jmx.local.port=7199, -Dcassandra.system_distributed_replication_dc_names=test, -Dcassandra.system_distributed_replication_per_dc=3, -XX:G1RSetUpdatingPauseTimePercent=5, -XX:MaxGCPauseMillis=500, -XX:+UseG1GC, -XX:+ParallelRefProcEnabled, -Djdk.attach.allowAttachSelf=true, --add-exports=java.base/jdk.internal.misc=ALL-UNNAMED, --add-opens=java.base/jdk.internal.module=ALL-UNNAMED, --add-exports=java.base/jdk.internal.ref=ALL-UNNAMED, --add-exports=java.base/jdk.internal.perf=ALL-UNNAMED, --add-exports=java.base/sun.nio.ch=ALL-UNNAMED, --add-exports=java.management.rmi/com.sun.jmx.remote.internal.rmi=ALL-UNNAMED, --add-exports=java.rmi/sun.rmi.registry=ALL-UNNAMED, --add-exports=java.rmi/sun.rmi.server=ALL-UNNAMED, --add-opens=jdk.management/com.sun.management.internal=ALL-UNNAMED, -Dio.netty.tryReflectionSetAccessible=true, -Xlog:gc=info,heap*=trace,age*=debug,safepoint=info,promotion*=trace:file=/opt/cassandra/logs/gc.log:time,uptime,pid,tid,level:filecount=10,filesize=10485760, -Xms1902M, -Xmx1902M, -XX:CompileCommandFile=/opt/cassandra/conf/hotspot_compiler, -javaagent:/opt/cassandra/lib/jamm-0.3.2.jar, -Dcassandra.jmx.remote.port=7199, -Dcom.sun.management.jmxremote.rmi.port=7199, -Dcom.sun.management.jmxremote.authenticate=true, -Dcom.sun.management.jmxremote.password.file=/etc/cassandra/jmxremote.password, -Djava.library.path=/opt/cassandra/lib/sigar-bin, -javaagent:/opt/metrics-collector/lib/datastax-mcac-agent.jar, -javaagent:/opt/management-api/datastax-mgmtapi-agent-0.1.0-SNAPSHOT.jar, -Dcassandra.libjemalloc=/usr/local/lib/libjemalloc.so, -XX:OnOutOfMemoryError=kill -9 %p, -Dlogback.configurationFile=logback.xml, -Dcassandra.logdir=/opt/cassandra/logs, -Dcassandra.storagedir=/opt/cassandra/data, -Dcassandra.server_process, -Dcassandra.skip_default_role_setup=true, -Ddb.unix_socket_file=/tmp/cassandra.sock, -Dcassandra.replace_address_first_boot=f59da36c-b3bc-46ad-a83e-d79a466cdb1e]
java.lang.RuntimeException: Replacement host name could not be resolved or scope_id was specified for a global IPv6 address
	at org.apache.cassandra.config.DatabaseDescriptor.getReplaceAddress(DatabaseDescriptor.java:1563)
	at org.apache.cassandra.service.StorageService.isReplacing(StorageService.java:859)
	at org.apache.cassandra.config.DatabaseDescriptor.getReplaceAddress(DatabaseDescriptor.java:1558)

So retrieving IPv4 address needs to work. Calling api/v0/metadata/endpoints manually to check the data. metadata_endpoints.json.txt

Edit: It looks that the code is expected to read NATIVE_TRANSPORT_ADDRESS but there is no such field. There is NATIVE_ADDRESS_AND_PORT, will make some tries with it.

Sep 03 '21 15:09 github-vincent-miszczak

I made a POC updating the metadata field used to get the IP. It's working fine :) https://github.com/github-vincent-miszczak/cass-operator/pull/1

Don't take this as a definitive patch, I have no idea of the correct field to use, still need to study. You guys may know better than me which one is correct.

This issue was a question to begin with, should also be labeled bug or equivalent. I hope it will be fixed soon.

Also my questions remain about having an automated workflow based on volume annotation as it looks to be already coded but limited to vSphere. I'd like to use it.

Sep 03 '21 17:09 github-vincent-miszczak

Well, thinking a bit more about this, there's an upcoming feature in Kubernetes which is called Volume Health Monitoring and if you have a proper CSI driver, it should be able to notify the pod automatically that the volume has died:

https://kubernetes.io/docs/concepts/storage/volume-health-monitoring/

I think building something around this feature would make more sense - assuming there's anything even needed on the cass-operator side.

As for your fix, it does not quite work - the NATIVE_ADDRESS_AND_PORT was only introduced in 4.0.0, it's not available in 3.11.* for example, so there needs to be some more version checks.

Sep 09 '21 06:09 burmanm

Thank you for pointing out this feature. It should help dealing with failed hosts/volumes. I still need to figure out how to handle health changes events that pods will receive. I'll try to test this, need to ensure the CSI I use is compliant (I don't use the more standard local because it does not play nicely with Cluster Autoscaler) I agree makes more sense on the long term if there is no need for a change on the operator.

Having comments on why https://github.com/k8ssandra/cass-operator/blob/master/pkg/reconciliation/check_nodes.go#L168 is not used can still be interesting IMO.

I didn't make a fix for the current issue, I made a POC meaning it illustrates the issue and gives clues for a fix. I renamed the draft to make this even more clear.

I'll make a feedback on my results, hopping I'll be able to activate what's needed on EKS.

Sep 10 '21 09:09 github-vincent-miszczak

I've been looking at Kubernetes Volume Health Monitoring and it does not look it can help with local storage.

it works with CSI drivers. Local storage including hostPath and local are not CSI managed and the controller does not apply
what it does is sending health events on PV/PVC. Those events would still need to be handled by something to trigger a CassandraDatacenter node replacement

Sep 27 '21 14:09 github-vincent-miszczak

That's not necessarily problematic to read those events. That's a hopefully standard feature in later versions of Kubernetes, so making feature around it is not a bad idea. Question is then for hostPath / local, if they're able to replicate/emulate this behaviour so we wouldn't need to create multiple implementations and how do we avoid rescheduling to the same degraded PV.

Sep 27 '21 15:09 burmanm

The volume condition subsystem looks to be exclusively reserved to CSI world but I agree related events can be mimicked for other storage https://github.com/container-storage-interface/spec/blob/master/csi.proto

The current external controller is the one handling events on PVC https://github.com/kubernetes-csi/external-health-monitor/blob/master/pkg/csi-handler/pv_checker.go#L117 At the moment it looks to use constant reason VolumeConditionAbnormal in the event.

A custom controller can generate those events.

How would you see handling of those events? Another standalone controller responding to those events by patching CassandraDatacenter, or something built in cass-operator directly, or something else?

Sep 27 '21 16:09 github-vincent-miszczak

Well, reading this implementation: https://github.com/kubernetes-csi/external-health-monitor I think it only adds an event to a pod with some not so structured message. So if you had your own controller for local / hostPath, then that's not an issue to create similarly simple process. I'm not convinced that this will live past the Alpha stage though, it seems very lacking to me.

As for catching these, client-go provides methods to listen for corev1.Events. We only record them at the moment in cass-operator, but there's no reason we couldn't listen to them also from objects that we manage (such as pods, even PVCs).

https://github.com/kubernetes/client-go/blob/7cbd2d5c7a8ca1e218942af67abe8b808603c4fd/tools/record/event_test.go

Instead of setting "replaceNode" parameter, the alternative is to listen to those events and create a reconcile request for the owner of that pod (the CassandraDatacenter) and then in the reconcile phase check if there are any events that should be taken care of.

Sep 28 '21 10:09 burmanm

We can investigate this after #338 is also done. We have to parse pod events in cass-operator to understand all these processes.

Jul 26 '22 09:07 burmanm

Closing this, since after creating this ticket the behavior of replace has changed. The new way to do it is through CassandraTask: https://github.com/k8ssandra/cass-operator/blob/master/tests/testdata/tasks/replace_node_task.yaml

That removes the PVCs and does all the necessary hoops.

Feb 03 '23 14:02 burmanm