scylla-operator Scylla cluster gets broken if prefer_local=true option is set in c* rackdc properties file

Describe the bug If we set "prefer_local" option to "true" in the "scylla-config" configmap file like here:

apiVersion: v1
kind: ConfigMap
metadata:
  name: scylla-config
  namespace: scylla
data:
  cassandra-rackdc.properties: |-
    prefer_local = true
  scylla.yaml: |
    alternator_enforce_authorization: false
    auto_bootstrap: true
    client_encryption_options:
      enabled: false
    experimental: true

And rollout Scylla clusters then we get following errors:

WARN  2021-09-24 18:30:54,318 [shard 0] cdc - Could not retrieve CDC streams with timestamp 2021/09/24 15:30:30 upon gossip event. Reason: "Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1". Action: continuing to retry in the background.

...

21:32:27  DEBUG    sdcm.cluster:cluster.py:1345 INFO  2021-09-24 17:01:25,314 [shard 0] gossip - InetAddress 10.96.232.54 is now UP, status = NORMAL
21:32:27  DEBUG    sdcm.cluster:cluster.py:1345 INFO  2021-09-24 17:01:25,444 [shard 0] gossip - InetAddress 10.96.232.54 is now DOWN, status = NORMAL
21:32:27  DEBUG    sdcm.cluster:cluster.py:1345 INFO  2021-09-24 17:01:25,444 [shard 0] gossip - InetAddress 10.96.77.136 is now DOWN, status = NORMAL

"nodetool status" command shows following:

 Datacenter: dc-1
 ================
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address       Load       Tokens       Owns    Host ID                               Rack
 DN  10.96.232.54  ?          256          ?       20eb5a74-f8b9-47f8-b0df-7dfbca268b05  kind
 UN  10.96.77.136  247.63 KB  256          ?       7b5cbd28-2754-4662-b6cf-1abc9debe9bc  kind
 DN  10.96.49.105  ?          256          ?       29eefa5b-233d-4b40-8db7-1d568aaf3288  kind

And above output differs based on the node we use for getting info.

Returning back the option value to "false" doesn't help.

To Reproduce Steps to reproduce the behavior:

Deploy operator
Deploy Scylla
Create configMap called "scylla-config"
Add to configMap the "prefer_local=true" option as part of the "cassandra-rackdc.properties" file
Rollout scylla
See error

Expected behavior Scylla must continue working as usual.

Logs kubernetes-dafbd317.tar.gz

Environment:

Platform: any
Kubernetes version: v1.21.1
Scylla version: 4.4.4
Scylla-operator version: e.g.: v1.5.0

Additional context Add any other context about the problem here.

Sep 24 '21 18:09 vponomaryov

Looks like it tries to connect to other nodes using prefered_ip address from system.peers which in case when prefer_local is true is equal to 0.0.0.0

connect(48, {sa_family=AF_INET, sin_port=htons(7000), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EINPROGRESS (Operation now in progress)

Some time ago we changed listen address to 0.0.0.0 to allow running Scylla in envs having service meshes - #529.

Sep 28 '21 18:09 zimnx

Hi, I've had the same problem without using the operator. Changed replication from SimpleStrategy to NetworkTopologyStrategy and snitch to GossipingPropertyFileSnitch with rackdc.properties file with prefer_local=true. In the first shutdown and rollout with new options things were ok, but after a whole cluster stop and start I am having the same error as you: dc - Could not retrieve CDC streams with timestamp xxxxxxxxx upon gossip event. Reason: "Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1". I have the pod ip set as listen address instead of 0.0.0.0 Is there a way of fixing this without creating a new cluster?

Feb 14 '22 15:02 juanramb

Just for the record, my problem was that the table system.peers preferred_ip field didn't get updated with each pod new IP after rollout.

I'm using a ClusterIp for each scylla instance. With simplestrategy the preferred_ip field was updated with the new pod IP.

I've manually updated the table to use the ClusterIP in preferred_ip and restart scylla on each node and the cluster is up again.

My config set that ClusterIP in --broadcast-address and --broadcast-rpc-address and the pod ip in --listen-address

Feb 15 '22 09:02 juanramb

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out

/lifecycle stale

Jun 27 '24 10:06 scylla-operator-bot[bot]

scylla-operator scylla-operator copied to clipboard

Scylla cluster gets broken if prefer_local=true option is set in c* rackdc properties file

scylla-operator
scylla-operator copied to clipboard