scylla-operator icon indicating copy to clipboard operation
scylla-operator copied to clipboard

Scylla cluster gets broken if prefer_local=true option is set in c* rackdc properties file

Open vponomaryov opened this issue 4 years ago • 4 comments

Describe the bug If we set "prefer_local" option to "true" in the "scylla-config" configmap file like here:

apiVersion: v1
kind: ConfigMap
metadata:
  name: scylla-config
  namespace: scylla
data:
  cassandra-rackdc.properties: |-
    prefer_local = true
  scylla.yaml: |
    alternator_enforce_authorization: false
    auto_bootstrap: true
    client_encryption_options:
      enabled: false
    experimental: true

And rollout Scylla clusters then we get following errors:

WARN  2021-09-24 18:30:54,318 [shard 0] cdc - Could not retrieve CDC streams with timestamp 2021/09/24 15:30:30 upon gossip event. Reason: "Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1". Action: continuing to retry in the background.

...

21:32:27  DEBUG    sdcm.cluster:cluster.py:1345 INFO  2021-09-24 17:01:25,314 [shard 0] gossip - InetAddress 10.96.232.54 is now UP, status = NORMAL
21:32:27  DEBUG    sdcm.cluster:cluster.py:1345 INFO  2021-09-24 17:01:25,444 [shard 0] gossip - InetAddress 10.96.232.54 is now DOWN, status = NORMAL
21:32:27  DEBUG    sdcm.cluster:cluster.py:1345 INFO  2021-09-24 17:01:25,444 [shard 0] gossip - InetAddress 10.96.77.136 is now DOWN, status = NORMAL

"nodetool status" command shows following:

 Datacenter: dc-1
 ================
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address       Load       Tokens       Owns    Host ID                               Rack
 DN  10.96.232.54  ?          256          ?       20eb5a74-f8b9-47f8-b0df-7dfbca268b05  kind
 UN  10.96.77.136  247.63 KB  256          ?       7b5cbd28-2754-4662-b6cf-1abc9debe9bc  kind
 DN  10.96.49.105  ?          256          ?       29eefa5b-233d-4b40-8db7-1d568aaf3288  kind

And above output differs based on the node we use for getting info.

Returning back the option value to "false" doesn't help.

To Reproduce Steps to reproduce the behavior:

  1. Deploy operator
  2. Deploy Scylla
  3. Create configMap called "scylla-config"
  4. Add to configMap the "prefer_local=true" option as part of the "cassandra-rackdc.properties" file
  5. Rollout scylla
  6. See error

Expected behavior Scylla must continue working as usual.

Logs kubernetes-dafbd317.tar.gz

Environment:

  • Platform: any
  • Kubernetes version: v1.21.1
  • Scylla version: 4.4.4
  • Scylla-operator version: e.g.: v1.5.0

Additional context Add any other context about the problem here.

vponomaryov avatar Sep 24 '21 18:09 vponomaryov

Looks like it tries to connect to other nodes using prefered_ip address from system.peers which in case when prefer_local is true is equal to 0.0.0.0

connect(48, {sa_family=AF_INET, sin_port=htons(7000), sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EINPROGRESS (Operation now in progress)

Some time ago we changed listen address to 0.0.0.0 to allow running Scylla in envs having service meshes - #529.

zimnx avatar Sep 28 '21 18:09 zimnx

Hi, I've had the same problem without using the operator. Changed replication from SimpleStrategy to NetworkTopologyStrategy and snitch to GossipingPropertyFileSnitch with rackdc.properties file with prefer_local=true. In the first shutdown and rollout with new options things were ok, but after a whole cluster stop and start I am having the same error as you: dc - Could not retrieve CDC streams with timestamp xxxxxxxxx upon gossip event. Reason: "Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1". I have the pod ip set as listen address instead of 0.0.0.0 Is there a way of fixing this without creating a new cluster?

juanramb avatar Feb 14 '22 15:02 juanramb

Just for the record, my problem was that the table system.peers preferred_ip field didn't get updated with each pod new IP after rollout.

I'm using a ClusterIp for each scylla instance. With simplestrategy the preferred_ip field was updated with the new pod IP.

I've manually updated the table to use the ClusterIP in preferred_ip and restart scylla on each node and the cluster is up again.

My config set that ClusterIP in --broadcast-address and --broadcast-rpc-address and the pod ip in --listen-address

juanramb avatar Feb 15 '22 09:02 juanramb

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out

/lifecycle stale