pulsar-helm-chart icon indicating copy to clipboard operation
pulsar-helm-chart copied to clipboard

how to direct connect to broker without proxy

Open youzipi opened this issue 1 year ago • 10 comments

i would prefer not to use a proxy. but i found broker does not have the ingress template.


for now, i deploy an ingress for broker individually.

youzipi avatar Jan 05 '24 03:01 youzipi

Ingress probably wouldn't make sense for Pulsar brokers, at least for the binary protocol. For the Pulsar Admin API that would be a feasible approach. The http/https protocol could also be used for topic lookups, so it would be sufficient to be used as the "serviceUrl". However, the Pulsar binary protocol would require a different approach.

You could use k8s node ports and Pulsar's "advertisedListeners" feature: https://pulsar.apache.org/docs/3.1.x/concepts-multiple-advertised-listeners/#advertised-listeners However, configuring that would require some special customization and integration to make it work with a Pulsar k8s deployment.

Another possibility is the SNI proxy feature and use a proxy that supports SNI proxying (for example Apache Traffic server or Nginx): https://pulsar.apache.org/docs/3.1.x/concepts-proxy-sni-routing/

lhotari avatar Jan 17 '24 11:01 lhotari

It would make sense to have a load balancer for the broker service that is used for lookups since the binary protocol is more efficient than using the REST API for lookups. The individual brokers need to be addressable directly and solving that requires a solution. I'd like to see an experiment for the nodeport + advertisedListeners solution. I guess that would be feasible in cloud managed k8s environments where it is possible to expose a k8s node with a routable address that the client could access.

lhotari avatar Jan 19 '24 10:01 lhotari

One problem with Pulsar Proxy is that it adds multiple cross AZ hops which incur network transfer costs in cloud k8s environments.

lhotari avatar Jan 19 '24 10:01 lhotari

Adding some more context here about the Pulsar Proxy.

https://pulsar.apache.org/docs/3.1.x/administration-proxy/ "Pulsar proxy is used when direct connections between clients and Pulsar brokers are either infeasible or undesirable"

For the "undesirable" part: At least in the past, some companies have had network security policies which emphasize network perimeter security with reference architectures where there must be a minimal proxy component for inbound network traffic that has minimal access to any other components and it is placed in a DMZ between 2 firewalls. Many companies still have such security policies in place.

When the Apache Pulsar PMC was handling the Pulsar Proxy security vulnerability https://pulsar.apache.org/security/CVE-2022-24280/, it was decided to add a notice to https://pulsar.apache.org/docs/3.1.x/administration-proxy/ that the Pulsar Proxy isn't designed to be exposed directly on the public internet: "The Pulsar proxy is not intended to be exposed on the public internet. The security considerations in the current design expect network perimeter security. The requirement of network perimeter security can be achieved with private networks."

For the "infeasible" part: This is probably about laziness. When something works, many don't care to optimize or improve the solution. The Pulsar Proxy is very easy to deploy in k8s as we can see in the Apache Pulsar Helm Chart.

The direct connection to brokers could be achieved with advertisedListeners and nodeports. It would be great to have a solution where this could be automated. The nodeport solution would require that the node has a routable address from clients. Since individual brokers don't require stable names, it would be sufficient to be able to advertise the node IP and nodeport.

Lookups could use the REST API configured with an ingress. There is also the possibility to have a loadbalancer for brokers that is used for lookups since that would be more efficient.

Another reason for a proxy like component is for lookups and federating multiple broker clusters into a single large cluster from the client perspective. In Pulsar, there was a component called "pulsar-discovery". This was removed by https://github.com/apache/pulsar/pull/12119 and there's discussion in https://github.com/apache/pulsar/issues/15225 about restoring it.

lhotari avatar Jan 19 '24 11:01 lhotari

Slightly related: The issue #437 describes a current problem with the headless broker service that should be addressed by adding a 2nd cluster ip service for lookups and making the headless broker service use publishNotReadyAddresses: true.

lhotari avatar Jan 19 '24 11:01 lhotari