strimzi-kafka-operator
strimzi-kafka-operator copied to clipboard
[Enhancement] Advertise pod IPs in listener
I'm using Strimzi on AWS EKS with the VPC CNI which means my pod IPs are routable and accessible from outside of the Kubernetes cluster.
Ideally like to be able to configure a listener to advertise the pod IPs so that clients outside of the cluster are able to connect to brokers directly rather than having to go through a nodeport service.
I'm struggling to figure out if this is possible with the current overrides available?
I'm afraid this is not possible right now.
It could be in theory implemented, but it would be I guess a bit non-trivial since the pod IP is not known before the pod is started. So we would need to get it on the fly. And that would also impact how it supports TLS hostname verification (which would need to be disabled if TLS is used). I'm also not sure how would you address bootstrapping - the pod IPs are volatile, so you would still need some service with stable DNS name I guess?
At the moment my workaround is to use a nodeport service advertising internal IP. I think this has all of the same issues as using the pod IP (it's just $nodeIP:$nodeport instead of $podip:$listenport).
Pod IP can be injected as an env variable pretty easily with the downward API for discovery.
I'm not using TLS so not an isssue, but i assume TLS has the same problems with nodeport listeners?
I'm using Consul service sync to register the bootstrap service into Consul which solves my bootstrapping problem. This actually registers the pod IPs and listener ports in Consul so bootstrapping works ideally, clients then discover a nodePort IP:port from the bootstrapping endpoint.
I think this has all of the same issues as using the pod IP (it's just $nodeIP:$nodeport instead of $podip:$listenport).
It does. But unlike node ports it is not yet implemented. Also, node ports have much wider use then this because they work everywhere. This network setup is a bit less common. So we need to consider the effort to develop it but to also maintain it.
If you want, we can change this to enhancement to keep it tracked. But I do not think I can make any promises to if/when we might get to this.
It does. But unlike node ports it is not yet implemented. Also, node ports have much wider use then this because they work everywhere. This network setup is a bit less common. So we need to consider the effort to develop it but to also maintain it.
Yep that's fair! It is the default networking setup for EKS but I guess it's also only an issue for out-of-cluster access
If you want, we can change this to enhancement to keep it tracked. But I do not think I can make any promises to if/when we might get to this.
That would be good, I think nodePort is an acceptable workaround for now too
This would also be convenient for Googles GKE k8s offering as the networking works the same way.
+1 for this feature, as our pod IPs are routable and accessible from outside the Kubernetes cluster as well (we use this setup for GKE and EKS)
I'm afraid this is not possible right now.
It could be in theory implemented, but it would be I guess a bit non-trivial since the pod IP is not known before the pod is started. So we would need to get it on the fly. And that would also impact how it supports TLS hostname verification (which would need to be disabled if TLS is used). I'm also not sure how would you address bootstrapping - the pod IPs are volatile, so you would still need some service with stable DNS name I guess?
@scholzj I don't know how complex it could be, but maybe you can get the pod IP from the init container as it is in the same network namespace.
I think this has all of the same issues as using the pod IP (it's just $nodeIP:$nodeport instead of $podip:$listenport).
It does. But unlike node ports it is not yet implemented. Also, node ports have much wider use then this because they work everywhere. This network setup is a bit less common. So we need to consider the effort to develop it but to also maintain it.
If you want, we can change this to enhancement to keep it tracked. But I do not think I can make any promises to if/when we might get to this.
+1 on this,
-
- our application is on different k8s cluster from the one the Strimzi is running on;
-
- but they are on the same AWS VPC, with CNI making the pods IP directly addressable;
We had to use external listener due to (i), there are some cons for all the options for ext listeners:
- NodePort:
- Traffic need to route via kube-proxy,
externalTrafficPolicyhelps a bit, still a slight overhead for broker traffic;
- Traffic need to route via kube-proxy,
- Ingress/Svc
- Not quite cost effective as a total of (N+1) LB is required (nb of broker + bootstrap); also this could easily runs into cloud provider quotas (aws for example)
- Introduce consistency issue as the cloud provider LB is unaware of the "readiness" of broker pods and could ends up in situation where broker pods are down from kafka client's POV due to the connectivity from LB -> Broker Pod is not up during pod restart / rolling deployment;
- [in AWS EKS] Can be some what mitigated by using NLB + IP targeting + Readiness Gate at the cost of slower deployment time (where NLB had an issue of init registration takes around 3min)
@scholzj
Contexts aside,
Apologies in advance if it's a noob question, would a "middle ground" solution work here?
- Bootstrap is exposed via node port or LB, so that a stable DNS name can be bind;
- while broker pod ip is return in the metadata response?
As explained above, Strimzi does not currently support using PodIPs as advertised listener names. Neither does it support mixing different ways of exposing for bootstrap and brokers (but if you want you can create a bootstrap loadbalancer manually and just point to the right port on the brokers).
PS: Please keep also in mind that there are reasons why we use pod DNS names over pod IPs => Pod IPs are not stable and that tends to cause issues to Kafka clients during rolling updates etc.
PS: Please keep also in mind that there are reasons why we use pod DNS names over pod IPs => Pod IPs are not stable and that tends to cause issues to Kafka clients during rolling updates etc.
Hi @scholzj
Thanks for the reply & context; It would be great if you could provide some insights regarding issues with Pod IPs?
My understanding was using pod ip should not be worse than DNS,
- when listener is internal, the cluster internal dns resolves to pod IP directly;
- when listener is external, the client sees the external advertised hostname/ip depends on the setup (where pod ip does not apply)
Regards,
The DNS names are stable. They do not change when you roll the pods for example. So the broker has always the same address. The Pod IPs on the other hand change with every restart of the broker pod. So every rolling update means new address for the pod. Plus in some cases the Pod IP addresses seem to be recycled very quickly - so the address which broker broker 1 had few seconds ago is given to broker 2 few seconds later.
Strimzi originally used pod IPs. But this kind of changes was confusing the clients and causing problems with clients reconnecting after the rolling updates. So we moved to the stable DNS names. (this is some time back of course. So not sure how today's clients would deal with it)
when listener is internal, the cluster internal dns resolves to pod IP directly;
It does. But while DNS has its own issues, it seems / seemed to work a lot better than the IPs being used directly.
when listener is external, the client sees the external advertised hostname/ip depends on the setup (where pod ip does not apply)
The external address - regardless whether DNS or IP - normally does not change with every rolling update.
I can understand the advantages which the Pod IPs might give you in some cases as described above. So do not think this is a blocker - it is just a reason why the default is what it is and why we probably don't want to change it. But I do not think we are opposed to having the Pod IPs as an alternative to the DNS names switched on by some flag. Just so far nobody implemented it.
I see, thanks for the context especially the IP recycling bit. I think I found the original issue regarding the ip -> dns change (https://github.com/strimzi/strimzi-kafka-operator/issues/50), although there isn't an exact example for such failure mode (could relates to client impl at the time)
Regarding the kafka client responding to brokers pod with changing / recycled IP, my thought is that:
- IP should be fine to be re-used or recycled, as long as the broker id remained the same (which is a property from sts);
- ZK should be always see the "correct" pod IP even the IP is recycled, so that we can "trust" the response of the metadata request;
When a client makes a request, it should fall into four buckets:
-
- The broker pod IP is the intended broker:
- All is good, nothing happens and moves on;
-
- Pod IP is stale and there is no pod behind it:
- client will attempt to establish TCP connection till timeout (using librdkafka as example, it would be
socket.timeout.ms, - when this happens the client should be triggered to refresh metadata, with ZK, the metadata response should contain the new and correct pod ip for the specific broker
- this is the behaviour of
librdkafka, can't say for all of the client impl, but would image something similar:
- this is the behaviour of
- client will attempt to establish TCP connection till timeout (using librdkafka as example, it would be
- Pod IP is stale and there is no pod behind it:
-
- Pod got rolling deployed when client is making a request
- client should see a broken TCP connection error, and triggers a retry similar as (
ii)
-
- Pod IP had recently recycled
- The client is making a request to a wrong broker, where the broker should reply with
RD_KAFKA_RESP_ERR_NOT_LEADER_FOR_PARTITION/NotLeaderForPartitionExceptionand client should triggers a retry similar as (ii)
Happy to discuss impl / way to submit a PR if ^ makes any sense.
Love to hear your thought,
Regards,
Yeah, what you described above is the theory how it should work. But it wasn't as straight forward in reality. But to be fair I'm not an expert on the Kafka client so I didn't deep dived into it to find what exactly was the issue.
got it, thanks!
i will try to duct tape https://github.com/strimzi/strimzi-kafka-operator/blob/release-0.26.x/cluster-operator/src/main/java/io/strimzi/operator/cluster/model/KafkaBrokerConfigurationBuilder.java#L196 & https://github.com/strimzi/strimzi-kafka-operator/blob/release-0.26.x/docker-images/kafka/scripts/kafka_config_generator.sh#L44-L45 to exposing vpc internal pod ip & dog food this first to see howo much of a difference it makes with node port.
will circle back the findings
Triaged on 7.6.2022: There are some usecases where this makes sense. But not too many to make it our priority. If anyone wants to contribute it, feel free to work on this / get in touch. Proposal should be written first.
Strimzi 0.32.0 introduced the type: cluster-ip listener which uses a ClusterIP type service for each broker. Is that something that solves this use-case?
Strimzi 0.32.0 introduced the
type: cluster-iplistener which uses a ClusterIP type service for each broker. Is that something that solves this use-case?
If I'm understanding the docs correctly with ClusterIP the brokers will advertise a unique kubernetes service DNS name?
That won't solve the problem for me as I want to access Kafka from outside of the current cluster, so those DNS entries won't resolve for my client apps.
I really just need the brokers to advertise that they are $PODIP:$LISTENPORT and everything will work.
Not a solution, but rather a workaround. It's possible to expose brokers within VPC using headless service + external-dns. Some example configuration (make sure 9093 port for this example is whitelisted in worker node SecurityGroup):
Kafka CR configuration
Cluster name is test-cluster. So pod names are test-cluster-kafka-*
Advertised listener has to be redefined to match DNS names which will be created by external-dns
listeners:
- name: clusterip
port: 9093
type: cluster-ip
authentication:
type: scram-sha-512
tls: false
configuration:
brokers:
- broker: 0
advertisedHost: test-cluster-kafka-0.prelive.kafka.test.net
advertisedPort: 9093
- broker: 1
advertisedHost: test-cluster-kafka-1.prelive.kafka.test.net
advertisedPort: 9093
- broker: 2
advertisedHost: test-cluster-kafka-2.prelive.kafka.test.net
advertisedPort: 9093
Headless service config
Using this service external-dns will create DNS records in pre-configured Route53 zone. DNS names will have the next format: POD_NAME.prelive.kafka.test.net
---
apiVersion: v1
kind: Service
metadata:
name: cluster-headless
annotations:
external-dns.alpha.kubernetes.io/hostname: prelive.kafka.test.net
external-dns.alpha.kubernetes.io/ttl: "60"
spec:
ports:
- port: 9093
name: tcp-clusterip
clusterIP: None
selector:
strimzi.io/component-type: kafka