strimzi-kafka-operator icon indicating copy to clipboard operation
strimzi-kafka-operator copied to clipboard

[Enhancement]: Allow setting inter-broker advertised address to cluster-ip

Open ventsislav-georgiev opened this issue 6 months ago • 6 comments

Related problem

We are using Strimzi in GKE with CloudDNS and occasionally have issues with CloudDNS not propagating dns records for headless services.

The issue breaks the cluster and entity operators from communicating with the brokers. Getting constant java.net.UnknownHostException for the hostname.subdomain.namespace.svc requests.

The above issue is a bit out of our hands and what we would like to do instead is to not rely on headless services. Is it possible to setup the inter-broker (REPLICATION:9091) address with ClusterIP instead of relying on Pod's FQDN?

Suggested solution

Using the GenericKafkaListenerConfigurationBroker with type: cluster-ip and broker.advertisedHost to the service does exactly what we want for the 9092/9094 communication.

However, we cannot set it for the 9091 inter-broker communication. Will be great if we could utilize the same approach.

Alternatives

No response

Additional context

No response

ventsislav-georgiev avatar Jan 12 '24 14:01 ventsislav-georgiev

@scholzj Is there any "hacky" workaround to modify the end result of the broker's /opt/kafka/custom-config/server.config?:

##########
... 
# Common listener configuration
##########
...
advertised.listeners=...
...

We need this for a test environment where we create and destroy kafka clusters many times as part of CI. So we are fine utilizing some non-production approach in order to bypass the issues with headless services of Cloud DNS.

ventsislav-georgiev avatar Jan 24 '24 08:01 ventsislav-georgiev

I think this is not just about the advertised hosts. you would also need to create the services in the right way etc. I expected that this will end up in the "needs proposal" state after it is triaged. Until then, yuu would need to fork the code and manage it yourself.

TBH, I'm not sure I understand the issue. The way Strimzi works does not rely on any spoecial DNS features. Just a standard Kubernetes DNS patterns for addressing pods. Why do you expect the DNS names for cluster IP service to work any better than the DNS resolution for the pod DNS names?

scholzj avatar Jan 24 '24 08:01 scholzj

It is just that the way Cloud DNS works (replaces kube-dns/coredns) and all the DNS resolving is done from GKE's metadata server. However the issue we experience is that headless services are not registered in the DNS server (missing DNS records). For Cluster IP services it works properly and the A record is immediately created in Cloud DNS. This is sporadic and is probably related to how often we create and destroy such namespaces with services.

For the tests we are using a single broker working also as controller in KRaft mode. So we can set the config for the cluster-operator to create the Cluster IP service for that broker and just need to redirect all network requests to use it.

For example for Kafka resource named: strimzi and KafkaNodePool named: dual-role in namespace named: temp-ns-xxx we currently have the following server.config:

advertised.listeners=REPLICATION-9091://strimzi-dual-role-0.strimzi-kafka-brokers.temp-ns-xxx.svc:9091,PLAINSASL-9094://strimzi-dual-role-0.strimzi-kafka-brokers.temp-ns-xxx.svc:9094

Setting the listener configuration type to cluster-ip will create a service with ClusterIP type for the broker and we need to update the the advertised.listener to use the service instead.

advertised.listeners=REPLICATION-9091://strimzi-kafka-broker.temp-ns-xxx.svc:9091,PLAINSASL-9094://strimzi-kafka-broker.temp-ns-xxx.svc:9094

ventsislav-georgiev avatar Jan 24 '24 08:01 ventsislav-georgiev

The issue has nothing to do with Kubernetes and Strimzi. We are just a bit out of options. Seems like it is reported here without progress: https://www.googlecloudcommunity.com/gc/Google-Kubernetes-Engine-GKE/GKE-autopilot-DNS-not-resolving/m-p/634344

ventsislav-georgiev avatar Jan 24 '24 09:01 ventsislav-georgiev

The issue has nothing to do with Kubernetes and Strimzi. We are just a bit out of options. Seems like it is reported here without progress: https://www.googlecloudcommunity.com/gc/Google-Kubernetes-Engine-GKE/GKE-autopilot-DNS-not-resolving/m-p/634344

Well, resolving the various DNS names is core to Kubernetes from my point of view. So if that does not work properly, it is quite hard to deal with it.

To be honest, implementing something like this would be quite a major change and I'm not sure we want to have a functionality like that to maintain and test for the years to come just to work around some Google issues.

scholzj avatar Jan 24 '24 09:01 scholzj

Triaged on the Strimzi Community call on 25.1.2024: There are some concerns about this:

  • This would be a lot of effort to implement and also maintain as an alternative path
  • The motivation seems to be questionable given it just seems to be a bug / limitation in one particular product

Should this be implemented, these things would need to be clarified in a proposal.

scholzj avatar Jan 25 '24 16:01 scholzj