cockroach-operator icon indicating copy to clipboard operation
cockroach-operator copied to clipboard

Operator specifies incorrect service DNS when running in a separate namespace

Open neurodrone opened this issue 3 years ago • 2 comments

Bug Description

We have observed that the Operator attempts to establish a DB connection to the Cockroach cluster running within the same kubernetes cluster, during one of its state transitions, and succeeds in doing so in all cases but one, where:

  • Operator is running in a separate namespace than the CrdbCluster custom-resource.

We see the following error when that happens:

   message: 'failed to create database connection: opening a DB connection failed
      testing db connection failed: lookup cockroachdb-public on x.x.x.x:53:
      no such host'
    status: Failed
    type: PartitionedUpdate

It is attempting to look up DNS record for cockroachdb-public service which doesn't exist in its namespace. However, if it did lookup cockroachdb-public.<namespace> it would have success finding it.

To Reproduce

Steps to reproduce:

  1. Deploy Operator in a namespace. (e.g. cockroach-operator-system)
  2. Deploy a CrdbCluster in a separate namespace (e.g. cockroach-cluster). Make sure that cockroach-cluster namespace is passed as an input in the WATCH_NAMESPACE environment variable of the Operator's container.
  3. A new Cockroach cluster should be spun up as a StatefulSet in the cockroach-cluster namespace.
  4. Perform an upgrade on the cluster. This can be done by updating the aforementioned CrdbCluster custom-resource by either changing its spec.cockroachDBVersion field or spec.image.name.
  5. Observe the above status message show up during the transition phase after the version check completes.

As a result of the above problem, the upgrade never gets initiated on the StatefulSet.

Expected behavior

No error is seen when conducting PartitionedUpdate action and actions like Cockroach server version upgrade complete successfully.

neurodrone avatar May 26 '22 18:05 neurodrone

Anyone with this issue who wants to do a quick hackaround you can:

  • Create a service in the operator namespace with the same name as your clusters service
  • Create an Endpoints resource that matches your service name and specify the cockroach actual service IP

Examples: Service

apiVersion: v1
kind: Service
metadata:
  name: crdb-public
  namespace: cockroach-operator-system
spec:
  ports:
  - name: grpc
    port: 26258
    protocol: TCP
    targetPort: 26258
  - name: http
    port: 8080
    protocol: TCP
    targetPort: 8080
  - name: sql
    port: 26257
    protocol: TCP
    targetPort: 26257
  type: ClusterIP

Endpoints

apiVersion: v1
kind: Endpoints
metadata:
  name: crdb-public
  namespace: cockroach-operator-system
subsets:
- addresses:
  - ip: 172.30.152.87 <- Your cockroachdb service IP
  ports:
  - name: grpc
    port: 26258
    protocol: TCP
  - name: http
    port: 8080
    protocol: TCP
  - name: sql
    port: 26257
    protocol: TCP

It's pretty nasty but gets the job done

ethan-gallant avatar Jul 29 '22 17:07 ethan-gallant

An improve over the workaround exposed by @ethan-gallant is to create a service of type ExternalName like this:

apiVersion: v1
kind: Service
metadata:
  name: cockroachdb-public
  namespace: cockroach-operator-system
spec:
  type: ExternalName
  sessionAffinity: None
  externalName: cockroachdb-public.cockroachdb.svc.cluster.local
  internalTrafficPolicy: Cluster

With this you dont need to add IPs nor ports.

Still a workaround and we look out for the https://github.com/cockroachdb/cockroach-operator/pull/907 fix. 😃

Mgonand avatar Aug 30 '22 14:08 Mgonand

I have same problem, Let me know if you find any solution.

javadnasrolahi avatar Dec 07 '22 15:12 javadnasrolahi