strimzi-kafka-operator icon indicating copy to clipboard operation
strimzi-kafka-operator copied to clipboard

[Enhancement] Advertise pod IPs in listener

Open hamishforbes opened this issue 4 years ago • 19 comments
trafficstars

I'm using Strimzi on AWS EKS with the VPC CNI which means my pod IPs are routable and accessible from outside of the Kubernetes cluster.

Ideally like to be able to configure a listener to advertise the pod IPs so that clients outside of the cluster are able to connect to brokers directly rather than having to go through a nodeport service.

I'm struggling to figure out if this is possible with the current overrides available?

hamishforbes avatar Nov 19 '20 23:11 hamishforbes

I'm afraid this is not possible right now.

It could be in theory implemented, but it would be I guess a bit non-trivial since the pod IP is not known before the pod is started. So we would need to get it on the fly. And that would also impact how it supports TLS hostname verification (which would need to be disabled if TLS is used). I'm also not sure how would you address bootstrapping - the pod IPs are volatile, so you would still need some service with stable DNS name I guess?

scholzj avatar Nov 19 '20 23:11 scholzj

At the moment my workaround is to use a nodeport service advertising internal IP. I think this has all of the same issues as using the pod IP (it's just $nodeIP:$nodeport instead of $podip:$listenport).

Pod IP can be injected as an env variable pretty easily with the downward API for discovery.

I'm not using TLS so not an isssue, but i assume TLS has the same problems with nodeport listeners?

I'm using Consul service sync to register the bootstrap service into Consul which solves my bootstrapping problem. This actually registers the pod IPs and listener ports in Consul so bootstrapping works ideally, clients then discover a nodePort IP:port from the bootstrapping endpoint.

hamishforbes avatar Nov 19 '20 23:11 hamishforbes

I think this has all of the same issues as using the pod IP (it's just $nodeIP:$nodeport instead of $podip:$listenport).

It does. But unlike node ports it is not yet implemented. Also, node ports have much wider use then this because they work everywhere. This network setup is a bit less common. So we need to consider the effort to develop it but to also maintain it.

If you want, we can change this to enhancement to keep it tracked. But I do not think I can make any promises to if/when we might get to this.

scholzj avatar Nov 20 '20 10:11 scholzj

It does. But unlike node ports it is not yet implemented. Also, node ports have much wider use then this because they work everywhere. This network setup is a bit less common. So we need to consider the effort to develop it but to also maintain it.

Yep that's fair! It is the default networking setup for EKS but I guess it's also only an issue for out-of-cluster access

If you want, we can change this to enhancement to keep it tracked. But I do not think I can make any promises to if/when we might get to this.

That would be good, I think nodePort is an acceptable workaround for now too

hamishforbes avatar Nov 20 '20 20:11 hamishforbes

This would also be convenient for Googles GKE k8s offering as the networking works the same way.

craiglservin avatar May 21 '21 18:05 craiglservin

+1 for this feature, as our pod IPs are routable and accessible from outside the Kubernetes cluster as well (we use this setup for GKE and EKS)

aryklein avatar Jul 26 '21 20:07 aryklein

I'm afraid this is not possible right now.

It could be in theory implemented, but it would be I guess a bit non-trivial since the pod IP is not known before the pod is started. So we would need to get it on the fly. And that would also impact how it supports TLS hostname verification (which would need to be disabled if TLS is used). I'm also not sure how would you address bootstrapping - the pod IPs are volatile, so you would still need some service with stable DNS name I guess?

@scholzj I don't know how complex it could be, but maybe you can get the pod IP from the init container as it is in the same network namespace.

aryklein avatar Jul 27 '21 14:07 aryklein

I think this has all of the same issues as using the pod IP (it's just $nodeIP:$nodeport instead of $podip:$listenport).

It does. But unlike node ports it is not yet implemented. Also, node ports have much wider use then this because they work everywhere. This network setup is a bit less common. So we need to consider the effort to develop it but to also maintain it.

If you want, we can change this to enhancement to keep it tracked. But I do not think I can make any promises to if/when we might get to this.

+1 on this,

    1. our application is on different k8s cluster from the one the Strimzi is running on;
    1. but they are on the same AWS VPC, with CNI making the pods IP directly addressable;

We had to use external listener due to (i), there are some cons for all the options for ext listeners:

  • NodePort:
    • Traffic need to route via kube-proxy, externalTrafficPolicy helps a bit, still a slight overhead for broker traffic;
  • Ingress/Svc
    • Not quite cost effective as a total of (N+1) LB is required (nb of broker + bootstrap); also this could easily runs into cloud provider quotas (aws for example)
    • Introduce consistency issue as the cloud provider LB is unaware of the "readiness" of broker pods and could ends up in situation where broker pods are down from kafka client's POV due to the connectivity from LB -> Broker Pod is not up during pod restart / rolling deployment;
      • [in AWS EKS] Can be some what mitigated by using NLB + IP targeting + Readiness Gate at the cost of slower deployment time (where NLB had an issue of init registration takes around 3min)

@scholzj

Contexts aside,
Apologies in advance if it's a noob question, would a "middle ground" solution work here?

  • Bootstrap is exposed via node port or LB, so that a stable DNS name can be bind;
  • while broker pod ip is return in the metadata response?

PaulLiang1 avatar Nov 08 '21 05:11 PaulLiang1

As explained above, Strimzi does not currently support using PodIPs as advertised listener names. Neither does it support mixing different ways of exposing for bootstrap and brokers (but if you want you can create a bootstrap loadbalancer manually and just point to the right port on the brokers).

scholzj avatar Nov 08 '21 08:11 scholzj

PS: Please keep also in mind that there are reasons why we use pod DNS names over pod IPs => Pod IPs are not stable and that tends to cause issues to Kafka clients during rolling updates etc.

scholzj avatar Nov 08 '21 08:11 scholzj

PS: Please keep also in mind that there are reasons why we use pod DNS names over pod IPs => Pod IPs are not stable and that tends to cause issues to Kafka clients during rolling updates etc.

Hi @scholzj

Thanks for the reply & context; It would be great if you could provide some insights regarding issues with Pod IPs?

My understanding was using pod ip should not be worse than DNS,

  • when listener is internal, the cluster internal dns resolves to pod IP directly;
  • when listener is external, the client sees the external advertised hostname/ip depends on the setup (where pod ip does not apply)

Regards,

PaulLiang1 avatar Nov 08 '21 10:11 PaulLiang1

The DNS names are stable. They do not change when you roll the pods for example. So the broker has always the same address. The Pod IPs on the other hand change with every restart of the broker pod. So every rolling update means new address for the pod. Plus in some cases the Pod IP addresses seem to be recycled very quickly - so the address which broker broker 1 had few seconds ago is given to broker 2 few seconds later.

Strimzi originally used pod IPs. But this kind of changes was confusing the clients and causing problems with clients reconnecting after the rolling updates. So we moved to the stable DNS names. (this is some time back of course. So not sure how today's clients would deal with it)

when listener is internal, the cluster internal dns resolves to pod IP directly;

It does. But while DNS has its own issues, it seems / seemed to work a lot better than the IPs being used directly.

when listener is external, the client sees the external advertised hostname/ip depends on the setup (where pod ip does not apply)

The external address - regardless whether DNS or IP - normally does not change with every rolling update.


I can understand the advantages which the Pod IPs might give you in some cases as described above. So do not think this is a blocker - it is just a reason why the default is what it is and why we probably don't want to change it. But I do not think we are opposed to having the Pod IPs as an alternative to the DNS names switched on by some flag. Just so far nobody implemented it.

scholzj avatar Nov 08 '21 11:11 scholzj

I see, thanks for the context especially the IP recycling bit. I think I found the original issue regarding the ip -> dns change (https://github.com/strimzi/strimzi-kafka-operator/issues/50), although there isn't an exact example for such failure mode (could relates to client impl at the time)


Regarding the kafka client responding to brokers pod with changing / recycled IP, my thought is that:

  • IP should be fine to be re-used or recycled, as long as the broker id remained the same (which is a property from sts);
  • ZK should be always see the "correct" pod IP even the IP is recycled, so that we can "trust" the response of the metadata request;

When a client makes a request, it should fall into four buckets:

    1. The broker pod IP is the intended broker:
    • All is good, nothing happens and moves on;
    1. Pod IP is stale and there is no pod behind it:
      • client will attempt to establish TCP connection till timeout (using librdkafka as example, it would be socket.timeout.ms,
      • when this happens the client should be triggered to refresh metadata, with ZK, the metadata response should contain the new and correct pod ip for the specific broker
    1. Pod got rolling deployed when client is making a request
    • client should see a broken TCP connection error, and triggers a retry similar as (ii)
    1. Pod IP had recently recycled
    • The client is making a request to a wrong broker, where the broker should reply with RD_KAFKA_RESP_ERR_NOT_LEADER_FOR_PARTITION / NotLeaderForPartitionException and client should triggers a retry similar as (ii)

Happy to discuss impl / way to submit a PR if ^ makes any sense.

Love to hear your thought,

Regards,

PaulLiang1 avatar Nov 08 '21 12:11 PaulLiang1

Yeah, what you described above is the theory how it should work. But it wasn't as straight forward in reality. But to be fair I'm not an expert on the Kafka client so I didn't deep dived into it to find what exactly was the issue.

scholzj avatar Nov 08 '21 12:11 scholzj

got it, thanks!

i will try to duct tape https://github.com/strimzi/strimzi-kafka-operator/blob/release-0.26.x/cluster-operator/src/main/java/io/strimzi/operator/cluster/model/KafkaBrokerConfigurationBuilder.java#L196 & https://github.com/strimzi/strimzi-kafka-operator/blob/release-0.26.x/docker-images/kafka/scripts/kafka_config_generator.sh#L44-L45 to exposing vpc internal pod ip & dog food this first to see howo much of a difference it makes with node port.

will circle back the findings

PaulLiang1 avatar Nov 08 '21 13:11 PaulLiang1

Triaged on 7.6.2022: There are some usecases where this makes sense. But not too many to make it our priority. If anyone wants to contribute it, feel free to work on this / get in touch. Proposal should be written first.

scholzj avatar Jun 07 '22 14:06 scholzj

Strimzi 0.32.0 introduced the type: cluster-ip listener which uses a ClusterIP type service for each broker. Is that something that solves this use-case?

scholzj avatar Jul 15 '23 17:07 scholzj

Strimzi 0.32.0 introduced the type: cluster-ip listener which uses a ClusterIP type service for each broker. Is that something that solves this use-case?

If I'm understanding the docs correctly with ClusterIP the brokers will advertise a unique kubernetes service DNS name?

That won't solve the problem for me as I want to access Kafka from outside of the current cluster, so those DNS entries won't resolve for my client apps.

I really just need the brokers to advertise that they are $PODIP:$LISTENPORT and everything will work.

hamishforbes avatar Jul 17 '23 21:07 hamishforbes

Not a solution, but rather a workaround. It's possible to expose brokers within VPC using headless service + external-dns. Some example configuration (make sure 9093 port for this example is whitelisted in worker node SecurityGroup):

Kafka CR configuration

Cluster name is test-cluster. So pod names are test-cluster-kafka-* Advertised listener has to be redefined to match DNS names which will be created by external-dns

    listeners:
      - name: clusterip
        port: 9093
        type: cluster-ip
        authentication:
          type: scram-sha-512
        tls: false
        configuration:
          brokers:
            - broker: 0
              advertisedHost: test-cluster-kafka-0.prelive.kafka.test.net
              advertisedPort: 9093
            - broker: 1
              advertisedHost: test-cluster-kafka-1.prelive.kafka.test.net
              advertisedPort: 9093
            - broker: 2
              advertisedHost: test-cluster-kafka-2.prelive.kafka.test.net
              advertisedPort: 9093
Headless service config

Using this service external-dns will create DNS records in pre-configured Route53 zone. DNS names will have the next format: POD_NAME.prelive.kafka.test.net

---
apiVersion: v1
kind: Service
metadata:
  name: cluster-headless
  annotations:
    external-dns.alpha.kubernetes.io/hostname: prelive.kafka.test.net
    external-dns.alpha.kubernetes.io/ttl: "60"
spec:
  ports:
    - port: 9093
      name: tcp-clusterip
  clusterIP: None
  selector:
    strimzi.io/component-type: kafka
This also allows to make an "external" (rather internal for VPC 😄 ) listener without TLS.

timchenko-a avatar Aug 24 '23 12:08 timchenko-a