scylla-operator
scylla-operator copied to clipboard
Scylla cluster gets broken if prefer_local=true option is set in c* rackdc properties file
Describe the bug If we set "prefer_local" option to "true" in the "scylla-config" configmap file like here:
apiVersion: v1
kind: ConfigMap
metadata:
name: scylla-config
namespace: scylla
data:
cassandra-rackdc.properties: |-
prefer_local = true
scylla.yaml: |
alternator_enforce_authorization: false
auto_bootstrap: true
client_encryption_options:
enabled: false
experimental: true
And rollout Scylla clusters then we get following errors:
WARN 2021-09-24 18:30:54,318 [shard 0] cdc - Could not retrieve CDC streams with timestamp 2021/09/24 15:30:30 upon gossip event. Reason: "Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1". Action: continuing to retry in the background.
...
21:32:27 DEBUG sdcm.cluster:cluster.py:1345 INFO 2021-09-24 17:01:25,314 [shard 0] gossip - InetAddress 10.96.232.54 is now UP, status = NORMAL
21:32:27 DEBUG sdcm.cluster:cluster.py:1345 INFO 2021-09-24 17:01:25,444 [shard 0] gossip - InetAddress 10.96.232.54 is now DOWN, status = NORMAL
21:32:27 DEBUG sdcm.cluster:cluster.py:1345 INFO 2021-09-24 17:01:25,444 [shard 0] gossip - InetAddress 10.96.77.136 is now DOWN, status = NORMAL
"nodetool status" command shows following:
Datacenter: dc-1
================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
DN 10.96.232.54 ? 256 ? 20eb5a74-f8b9-47f8-b0df-7dfbca268b05 kind
UN 10.96.77.136 247.63 KB 256 ? 7b5cbd28-2754-4662-b6cf-1abc9debe9bc kind
DN 10.96.49.105 ? 256 ? 29eefa5b-233d-4b40-8db7-1d568aaf3288 kind
And above output differs based on the node we use for getting info.
Returning back the option value to "false" doesn't help.
To Reproduce Steps to reproduce the behavior:
- Deploy operator
- Deploy Scylla
- Create configMap called "scylla-config"
- Add to configMap the "prefer_local=true" option as part of the "cassandra-rackdc.properties" file
- Rollout scylla
- See error
Expected behavior Scylla must continue working as usual.
Logs kubernetes-dafbd317.tar.gz
Environment:
- Platform: any
- Kubernetes version: v1.21.1
- Scylla version: 4.4.4
- Scylla-operator version: e.g.: v1.5.0
Additional context Add any other context about the problem here.
Looks like it tries to connect to other nodes using prefered_ip address from system.peers which in case when prefer_local is true is equal to 0.0.0.0
connect(48, {sa_family=AF_INET, sin_port=htons(7000), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EINPROGRESS (Operation now in progress)
Some time ago we changed listen address to 0.0.0.0 to allow running Scylla in envs having service meshes - #529.
Hi, I've had the same problem without using the operator. Changed replication from SimpleStrategy to NetworkTopologyStrategy and snitch to GossipingPropertyFileSnitch with rackdc.properties file with prefer_local=true. In the first shutdown and rollout with new options things were ok, but after a whole cluster stop and start I am having the same error as you: dc - Could not retrieve CDC streams with timestamp xxxxxxxxx upon gossip event. Reason: "Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1". I have the pod ip set as listen address instead of 0.0.0.0 Is there a way of fixing this without creating a new cluster?
Just for the record, my problem was that the table system.peers preferred_ip field didn't get updated with each pod new IP after rollout.
I'm using a ClusterIp for each scylla instance. With simplestrategy the preferred_ip field was updated with the new pod IP.
I've manually updated the table to use the ClusterIP in preferred_ip and restart scylla on each node and the cluster is up again.
My config set that ClusterIP in --broadcast-address and --broadcast-rpc-address and the pod ip in --listen-address
The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 30d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out
/lifecycle stale