scylla-operator icon indicating copy to clipboard operation
scylla-operator copied to clipboard

ScyllaCluster's joined across namespaces in e2e

Open tnozicka opened this issue 3 years ago • 4 comments

In one of our e2e runs I've seen scylla to report live host from 2 distinct namespaces:

  Expected
      <[]string | len:7, cap:8>: ["10.101.211.39", "10.103.182.113", "10.104.198.68", "10.104.212.153", "10.105.244.156", "10.107.86.7", "10.109.55.176"]
  to have length 3
  In [It] at: github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/verify.go:127
e2e-namespaces/e2e-test-scyllacluster-kx44s-b4mlz/core_v1/services/basic-us-east-1-rack-0-1.yaml:27:    clusterIP: 10.109.55.176
e2e-namespaces/e2e-test-scyllacluster-kx44s-b4mlz/core_v1/services/basic-us-east-1-rack-0-2.yaml:27:    clusterIP: 10.101.211.39
e2e-namespaces/e2e-test-scyllacluster-kx44s-b4mlz/core_v1/services/basic-us-east-1-rack-0-0.yaml:27:    clusterIP: 10.104.198.68
e2e-namespaces/e2e-test-scyllacluster-9t28c-jdmfm/core_v1/services/basic-us-east-1-rack-1-0.yaml:27:    clusterIP: 10.104.212.153
e2e-namespaces/e2e-test-scyllacluster-9t28c-jdmfm/core_v1/services/basic-us-east-1-rack-0-1.yaml:27:    clusterIP: 10.105.244.156
e2e-namespaces/e2e-test-scyllacluster-9t28c-jdmfm/core_v1/services/basic-us-east-1-rack-0-2.yaml:27:    clusterIP: 10.107.86.7
e2e-namespaces/e2e-test-scyllacluster-9t28c-jdmfm/core_v1/services/basic-us-east-1-rack-0-0.yaml:27:    clusterIP: 10.103.182.113

https://github.com/scylladb/scylla-operator/runs/4819154541?check_suite_focus=true#step:12:816 https://github.com/scylladb/scylla-operator/suites/4940241257/artifacts/143046650

tnozicka avatar Jan 17 '22 09:01 tnozicka

how come there was network connectivity between them ? aren't the clusters isolated ?

slivne avatar Jan 17 '22 09:01 slivne

@slivne they aren't, it happened in our e2e. Our e2e minikube doesn't support NetworkPolicies by default. We would have to install custom CNI. On production envs, users may configure NetworkPolicy on their own to isolate Pods.

Having one would also hide possible issue. The problem is that they shouldn't know anything about each other - configs should be separated - yet they managed to connect . Unless Scylla does the discovery on its own, but afaik it doesn't, they shouldn't form a cluster.

zimnx avatar Jan 17 '22 09:01 zimnx

While going through the Kubernetes audit logs, on upgrade, one of the pods got recycled PodIP after the other cluster - maybe gossip was using it for membership an that's how those clusters got connected. Normally Pod traffic is isolated between namespaces but not in minikube.

tnozicka avatar Jan 17 '22 13:01 tnozicka

I am lowering the priority and scheduling it to 1.8. This has always been the case so it's not a regression in the operator. Unfortunately, we use the ScyllaCluster name for cluster identity without the namespace. Migrating it over in a backwards compatible manner also addressing the preexisting clusters would take extreme effort. As we were already planing to setup mTLS for scylla node we are gonna aim for it as the actual fix.

The workaround to avoid hitting this issue is to name your ScyllaClusters uniquely across namespaces or avoid upgrading (or otherwise replacing pods) for more then one cluster at a time.

tnozicka avatar Jan 19 '22 14:01 tnozicka

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out

/lifecycle stale

tracked in https://github.com/scylladb/scylla-operator/issues/928

tnozicka avatar Jun 24 '24 09:06 tnozicka