cluster-operator Nodes availability in k8s

Hello,

We are using rabbitmq cluster of 3 nodes with rmq operator on a k8s/rook-ceph/calico-vxlan on-premises cluster. But from time to time we have issues, that everything in the cluster looks like it works as expected, client is connected successfully but the rmq cluster or a single node does not confirm receiving messages. After we restart the node causing issues everything is working properly. As I see there are no livenesses readiness checks on the nodes which mean that if a node does not respond the k8s service in front continue to send traffic to it. Also on a test installation we are now we have a network partition when 2 of 3 nodes are paused (regarding documentation the one that is left alone should be ???). And apart from the web interface there are no evidences that there is an issue. Also they look like are working in a "split brain" mode regarding the setting "cluster_partition_handling = pause_minority". The two partitions have active connections and queues and receive messages ... We have trying to find a reliable way to set some health checks but none of the checks showing a problem. Apart from the web management only 2 places show that there is an issue and the one is deprecated the second isn't suitable for health check. the deprecated check: rabbitmq@rabbitmq-server-1:/$ rabbitmq-diagnostics node_health_check This command is DEPRECATED and will be removed in a future version. It performs intrusive, opinionated health checks and requires a fully booted node. Use one of the options covered in https://www.rabbitmq.com/monitoring.html#health-checks instead. Timeout: 70 seconds ... Checking health of node [email protected] ... Error: Error: health check failed. Message: cluster partition in effect: ['[email protected]', '[email protected]']

the other one that indicate some issue is:

curl  --silent  -u secret:secret http://192.168.161.235:15672/api/nodes/[email protected]|jq .running
false
 curl  --silent  -u secret:secret http://192.168.161.235:15672/api/nodes/[email protected]|jq .running
true
 curl  --silent  -u secret:secret http://192.168.161.235:15672/api/nodes/[email protected]|jq .running
false

Here are the recommended health checks results:

rabbitmq@rabbitmq-server-1:/$ rabbitmq-diagnostics check_running
Checking if RabbitMQ is running on node [email protected] ...
RabbitMQ on node [email protected] is fully booted and running
rabbitmq@rabbitmq-server-1:/$

rabbitmq@rabbitmq-server-1:/$ rabbitmq-diagnostics ping
Will ping [email protected]. This only checks if the OS process is running and registered with epmd. Timeout: 60000 ms.
Ping succeeded
rabbitmq@rabbitmq-server-1:/$

rabbitmq@rabbitmq-server-1:/$ rabbitmq-plugins -q list --enabled --minimal
rabbitmq_management
rabbitmq_peer_discovery_k8s
rabbitmq_prometheus
rabbitmq_shovel
rabbitmq_shovel_management
rabbitmq@rabbitmq-server-1:/

So no way to have reliable readiness check in order to kubernetes to at least stop sending clients traffic to the not working nodes.

Here is the configuration we use: enabled_plugins

[rabbitmq_peer_discovery_k8s,rabbitmq_prometheus,rabbitmq_management,rabbitmq_shovel,rabbitmq_shovel_management].

operatorDefaults.conf
cluster_formation.peer_discovery_backend             = rabbit_peer_discovery_k8s
cluster_formation.k8s.host                           = kubernetes.default
cluster_formation.k8s.address_type                   = hostname
cluster_partition_handling                           = pause_minority
queue_master_locator                                 = min-masters
disk_free_limit.absolute                             = 2GB
cluster_formation.randomized_startup_delay_range.min = 0
cluster_formation.randomized_startup_delay_range.max = 60
cluster_name                                         = rabbitmq

userDefinedConfiguration.conf
total_memory_available_override_value = 858993460
cluster_partition_handling            = pause_minority
vm_memory_high_watermark_paging_ratio = 0.99
disk_free_limit.relative              = 1.0
vm_memory_high_watermark.relative     = 0.9

policies:
expires: 3600000
ha-mode: exactly
ha-params: 2
ha-promote-on-failure: when-synced
ha-promote-on-shutdown: when-synced

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
    name: rabbitmq
    namespace: staging
spec:
    replicas: 3
    image: rabbitmq:3.8.25-management
    service:
        type: LoadBalancer
    override:
        service:
            spec:
                loadBalancerIP: XXX.XXX.XXX.XXX
    resources:
        requests:
            cpu: "100m"
            memory: 1Gi
        limits:
            cpu: "1000m"
            memory: 1Gi
    affinity:
        podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                - key: app.kubernetes.io/part-of
                  operator: In
                  values:
                  - rabbitmq
              topologyKey: kubernetes.io/hostname
    rabbitmq:
        additionalConfig: |
            cluster_partition_handling = pause_minority
            vm_memory_high_watermark_paging_ratio = 0.99
            disk_free_limit.relative = 1.0
            vm_memory_high_watermark.relative = 0.9
        additionalPlugins:
            - rabbitmq_shovel
            - rabbitmq_shovel_management
    persistence:
        storage: "10Gi"

versions: rabbitmq:3.8.25-management rabbitmqoperator/cluster-operator:1.9.0

Any suggestions, recommendations, questions appreciated.

Jul 12 '22 09:07 george2asenov

@george2asenov The operator does defines a readiness probe but does not define a liveness probe. We have a tcp readiness probe that checks the amqp port https://github.com/rabbitmq/cluster-operator/blob/main/internal/resource/statefulset.go#L586-L603:

      readinessProbe:
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: amqp
          timeoutSeconds: 5

So if a node does not respond the k8s service won't send traffic to it.

Regarding liveness probe, we decided not to define one because k8s will restart your container when the starting process of the container (in our case it's the rabbitmq server process) crashes. We think that's enough for most use case.

You said that even though cluster_partition_handling is set to pause_minority, both partitions have active connections and queues and receive messages. This shouldn't happen. pause_minority means that the paused node will no long listen on any ports (see https://www.rabbitmq.com/partitions.html#:~:text=In%20pause%2Dminority%20mode%20RabbitMQ,availability%20from%20the%20CAP%20theorem and https://www.rabbitmq.com/partitions.html#pause-minority). If this is not what you have observed and you still have access to logs, could you please attach logs of the minority node when partition is happening so the team could look into it more?

Jul 14 '22 12:07 ChunyiLyu

This issue has been marked as stale due to 60 days of inactivity. Stale issues will be closed after a further 30 days of inactivity; please remove the stale label in order to prevent this occurring.

Sep 13 '22 00:09 github-actions[bot]

Closing stale issue due to further inactivity.

Oct 13 '22 00:10 github-actions[bot]

cluster-operator cluster-operator copied to clipboard

Nodes availability in k8s

cluster-operator
cluster-operator copied to clipboard