cockroach-operator
cockroach-operator copied to clipboard
Operator specifies incorrect service DNS when running in a separate namespace
Bug Description
We have observed that the Operator attempts to establish a DB connection to the Cockroach cluster running within the same kubernetes cluster, during one of its state transitions, and succeeds in doing so in all cases but one, where:
- Operator is running in a separate namespace than the
CrdbClustercustom-resource.
We see the following error when that happens:
message: 'failed to create database connection: opening a DB connection failed
testing db connection failed: lookup cockroachdb-public on x.x.x.x:53:
no such host'
status: Failed
type: PartitionedUpdate
It is attempting to look up DNS record for cockroachdb-public service which doesn't exist in its namespace. However, if it did lookup cockroachdb-public.<namespace> it would have success finding it.
To Reproduce
Steps to reproduce:
- Deploy Operator in a namespace. (e.g.
cockroach-operator-system) - Deploy a
CrdbClusterin a separate namespace (e.g.cockroach-cluster). Make sure thatcockroach-clusternamespace is passed as an input in theWATCH_NAMESPACEenvironment variable of the Operator's container. - A new Cockroach cluster should be spun up as a StatefulSet in the
cockroach-clusternamespace. - Perform an upgrade on the cluster. This can be done by updating the aforementioned
CrdbClustercustom-resource by either changing itsspec.cockroachDBVersionfield orspec.image.name. - Observe the above status message show up during the transition phase after the version check completes.
As a result of the above problem, the upgrade never gets initiated on the StatefulSet.
Expected behavior
No error is seen when conducting PartitionedUpdate action and actions like Cockroach server version upgrade complete successfully.
Anyone with this issue who wants to do a quick hackaround you can:
- Create a service in the operator namespace with the same name as your clusters service
- Create an
Endpointsresource that matches your service name and specify the cockroach actual service IP
Examples: Service
apiVersion: v1
kind: Service
metadata:
name: crdb-public
namespace: cockroach-operator-system
spec:
ports:
- name: grpc
port: 26258
protocol: TCP
targetPort: 26258
- name: http
port: 8080
protocol: TCP
targetPort: 8080
- name: sql
port: 26257
protocol: TCP
targetPort: 26257
type: ClusterIP
Endpoints
apiVersion: v1
kind: Endpoints
metadata:
name: crdb-public
namespace: cockroach-operator-system
subsets:
- addresses:
- ip: 172.30.152.87 <- Your cockroachdb service IP
ports:
- name: grpc
port: 26258
protocol: TCP
- name: http
port: 8080
protocol: TCP
- name: sql
port: 26257
protocol: TCP
It's pretty nasty but gets the job done
An improve over the workaround exposed by @ethan-gallant is to create a service of type ExternalName like this:
apiVersion: v1
kind: Service
metadata:
name: cockroachdb-public
namespace: cockroach-operator-system
spec:
type: ExternalName
sessionAffinity: None
externalName: cockroachdb-public.cockroachdb.svc.cluster.local
internalTrafficPolicy: Cluster
With this you dont need to add IPs nor ports.
Still a workaround and we look out for the https://github.com/cockroachdb/cockroach-operator/pull/907 fix. 😃
I have same problem, Let me know if you find any solution.