pulsar-helm-chart icon indicating copy to clipboard operation
pulsar-helm-chart copied to clipboard

DNS resolutions errors with Broker host names returned by Pulsar lookups

Open lhotari opened this issue 2 years ago • 2 comments

There's currently a conflicting problem with the Pulsar k8s deployment and how Pulsar load balancing works.

When a Pulsar broker starts, it will register itself as a broker in the internal Pulsar load balancer. Pulsar load balancer might immediately assign new namespace bundles to the broker and the topics might immediately get requests.

The conflicting problem is that DNS resolution for the broker's host name will fail with the current settings until the broker's readiness probe succeeds.

Pulsar might already return the hostname of a specific broker to a client, but the client cannot resolve the DNS name since the broker's readiness probe hasn't passed. This causes extra delays and also bugs when connecting to topics after a load balancing event. Pulsar clients usually backoff and retry. For Admin API HTTP requests, clients might not properly handle errors and for example Pulsar Proxy will fail the request when there's a DNS lookup issue.

solution: Broker statefulset's service should use publishNotReadyAddresses: true

There's useful information about stateful sets and publishNotReadyAddresses setting: https://github.com/k8ssandra/cass-operator/pull/18

There's an alternative solution in #198 which is fine for cases where TLS is disabled for brokers. Stable hostnames are required when using TLS to be able to do hostname verification for the certificates.

lhotari avatar Apr 27 '22 09:04 lhotari

I made an experiment to add a new service and make the broker sts use this service: https://github.com/datastax/pulsar-helm-chart/commit/259341cedea8c905544b44019d7d13f30508f365

The problem is that it's not possible to change the serviceName for a STS:

Error: UPGRADE FAILED: cannot patch "pulsar-testenv-pulsar-broker" with kind StatefulSet: StatefulSet.apps "pulsar-testenv-pulsar-broker" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy', 'persistentVolumeClaimRetentionPolicy' and 'minReadySeconds' are forbidden

We would like to have 2 service for the broker STS:

  • 1 service that uses publishNotReadyAddresses: true
  • another service that doesn't use publishNotReadyAddresses: true. This would be used to redirect traffic hitting the service only to brokers that pass the readiness probe.

It doesn't seem to be possible to keep backwards compatibility for existing deployments with the above requirements.

lhotari avatar Apr 27 '22 09:04 lhotari

@lhotari To support the upgrade path, can you switch the purpose of the services? So you don't have to modify the StatefulSet, use the existing name for the service that does use publishNotReadyAddresses: true setting and a new service that does? The proxy should point to the service that only routes traffic if the broker is ready, so that the proxy doesn't send traffic to a broker that can't handle it.

cdbartholomew avatar May 12 '22 15:05 cdbartholomew