telepresence icon indicating copy to clipboard operation
telepresence copied to clipboard

Telepresence Fails to Proxy Local Outbound Traffic to In-Cluster Kafka Broker behind Headless Service

Open revero-doug opened this issue 2 years ago • 2 comments

Describe the bug

DNS successfully finds the pod IP of the single broker running in my kafka cluster (using the kafka-ephemeral-single strimzi example with useServiceDnsDomain: true set as described in #1722), verified using the dns module in my simple test process based on the example in KafkaJS's README, but attempts to connect to that broker time out. When I create a headful service to proxy requests to that pod and set advertisedHost to that service's fqdn, it works as expected. I'm running default minikube config on a darwin macbook pro. AFAICT the only input variable that changes between the failing and passing setups is headless/headful service; IP-wise, the services and pods are on different subnets, another differentiating factor between headless and headful services in this setup.

tl;dr: Telepresence + Strimzi Kafka Operator + Basic/Default Kafka manifest + darwin MBP + default minikube cluster = does not work without workarounds in addition to useServiceDnsDomain: true

To Reproduce Steps to reproduce the behavior:

Prepare environment

  1. use a MBP with m1 pro chip (darwin)
  2. install Docker Desktop
  3. create a minikube cluster with default config
  4. brew install helm
  5. brew install datawire/blackbird/telepresence
  6. helm repo add strimzi https://strimzi.io/charts/
  7. helm install --create-namespace --namespace strimzi --set "watchNamespaces={default}" strimzi strimzi/strimzi-kafka-operator
  8. wget https://raw.githubusercontent.com/strimzi/strimzi-kafka-operator/0.31.1/examples/kafka/kafka-ephemeral-single.yaml and modify as follows
    • set first listener configuration to {"useServiceDnsDomain": true} (this works around original issue described in #1722)
  9. plug in the bootstrap service fqdn as the sole item in the brokers array in the example here: https://github.com/tulios/kafkajs
  10. run locally with telepresence intercepting an unrelated, isolated service in the same namespace
  11. optional: import and use the dns module to observe that DNS is resolving correctly but the IP is unreachable

Expected behavior locally running the trivial test program in KafkaJS's README w/ bootstrap fqdns configured correctly resolves DNS and is able to route to the broker, resulting in logging some messages until killing the process.

Versions (please complete the following information):

  • telepresence v2.7.6
  • macOS 12.6 on MBP (14-inch 2021) with M1 Pro chip
  • minikube v1.27.1; k8s client version: v1.25.3; kustomize version: v4.5.7; k8s server version: v1.24.3

VPN-related bugs: n/a

Additional context

  • creating a headful service in front of the single broker pod (and adjusting the Kafka resource definition to advertise that host) enables correct functioning
  • pods IPs and headful service cluster IPs are on different subnets

workaround

  1. merge into Kafka resource first listener configuration { "brokers": [{"broker": 0, "advertisedHost": "my-cluster-kafka-broker-proxy.default.svc.cluster.local"}]}
  2. apply proxy resource my-cluster-kafka-broker-proxy seen below
  3. plug in the bootstrap service fqdn as the sole item in the brokers array in the example here: https://github.com/tulios/kafkajs
  4. run locally with telepresence intercepting an unrelated, isolated service in the same namespace
apiVersion: v1
kind: Service
metadata:
  name: my-cluster-kafka-broker-proxy
spec:
  ports:
    - port: 9092
      protocol: TCP
      targetPort: 9092
  selector:
    app.kubernetes.io/instance: my-cluster
    app.kubernetes.io/part-of: strimzi-my-cluster
    strimzi.io/name: my-cluster-kafka
    strimzi.io/pod-name: my-cluster-kafka-0

revero-doug avatar Oct 14 '22 17:10 revero-doug

@cindymullins-dw can you explain why this is categorized a feature request as opposed to a bug report as it was intended? comments from maintainers as well as documentation suggest this should already work; if it's a known limitation, I'd suggest treating this gap as a documentation bug.

revero-doug avatar Nov 23 '22 20:11 revero-doug

maintainers - please address this as a bug, not a feature. all documentation points to this use case being supported.

revero-doug avatar Dec 14 '22 22:12 revero-doug