scylla-operator
scylla-operator copied to clipboard
ScyllaCluster's joined across namespaces in e2e
In one of our e2e runs I've seen scylla to report live host from 2 distinct namespaces:
Expected
<[]string | len:7, cap:8>: ["10.101.211.39", "10.103.182.113", "10.104.198.68", "10.104.212.153", "10.105.244.156", "10.107.86.7", "10.109.55.176"]
to have length 3
In [It] at: github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/verify.go:127
e2e-namespaces/e2e-test-scyllacluster-kx44s-b4mlz/core_v1/services/basic-us-east-1-rack-0-1.yaml:27: clusterIP: 10.109.55.176
e2e-namespaces/e2e-test-scyllacluster-kx44s-b4mlz/core_v1/services/basic-us-east-1-rack-0-2.yaml:27: clusterIP: 10.101.211.39
e2e-namespaces/e2e-test-scyllacluster-kx44s-b4mlz/core_v1/services/basic-us-east-1-rack-0-0.yaml:27: clusterIP: 10.104.198.68
e2e-namespaces/e2e-test-scyllacluster-9t28c-jdmfm/core_v1/services/basic-us-east-1-rack-1-0.yaml:27: clusterIP: 10.104.212.153
e2e-namespaces/e2e-test-scyllacluster-9t28c-jdmfm/core_v1/services/basic-us-east-1-rack-0-1.yaml:27: clusterIP: 10.105.244.156
e2e-namespaces/e2e-test-scyllacluster-9t28c-jdmfm/core_v1/services/basic-us-east-1-rack-0-2.yaml:27: clusterIP: 10.107.86.7
e2e-namespaces/e2e-test-scyllacluster-9t28c-jdmfm/core_v1/services/basic-us-east-1-rack-0-0.yaml:27: clusterIP: 10.103.182.113
https://github.com/scylladb/scylla-operator/runs/4819154541?check_suite_focus=true#step:12:816 https://github.com/scylladb/scylla-operator/suites/4940241257/artifacts/143046650
how come there was network connectivity between them ? aren't the clusters isolated ?
@slivne they aren't, it happened in our e2e. Our e2e minikube doesn't support NetworkPolicies by default. We would have to install custom CNI. On production envs, users may configure NetworkPolicy on their own to isolate Pods.
Having one would also hide possible issue. The problem is that they shouldn't know anything about each other - configs should be separated - yet they managed to connect . Unless Scylla does the discovery on its own, but afaik it doesn't, they shouldn't form a cluster.
While going through the Kubernetes audit logs, on upgrade, one of the pods got recycled PodIP after the other cluster - maybe gossip was using it for membership an that's how those clusters got connected. Normally Pod traffic is isolated between namespaces but not in minikube.
I am lowering the priority and scheduling it to 1.8. This has always been the case so it's not a regression in the operator. Unfortunately, we use the ScyllaCluster name for cluster identity without the namespace. Migrating it over in a backwards compatible manner also addressing the preexisting clusters would take extreme effort. As we were already planing to setup mTLS for scylla node we are gonna aim for it as the actual fix.
The workaround to avoid hitting this issue is to name your ScyllaClusters uniquely across namespaces or avoid upgrading (or otherwise replacing pods) for more then one cluster at a time.
The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 30d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out
/lifecycle stale
tracked in https://github.com/scylladb/scylla-operator/issues/928