pulsar-helm-chart
pulsar-helm-chart copied to clipboard
DNS resolutions errors with Broker host names returned by Pulsar lookups
There's currently a conflicting problem with the Pulsar k8s deployment and how Pulsar load balancing works.
When a Pulsar broker starts, it will register itself as a broker in the internal Pulsar load balancer. Pulsar load balancer might immediately assign new namespace bundles to the broker and the topics might immediately get requests.
The conflicting problem is that DNS resolution for the broker's host name will fail with the current settings until the broker's readiness probe succeeds.
Pulsar might already return the hostname of a specific broker to a client, but the client cannot resolve the DNS name since the broker's readiness probe hasn't passed. This causes extra delays and also bugs when connecting to topics after a load balancing event. Pulsar clients usually backoff and retry. For Admin API HTTP requests, clients might not properly handle errors and for example Pulsar Proxy will fail the request when there's a DNS lookup issue.
solution:
Broker statefulset's service should use publishNotReadyAddresses: true
There's useful information about stateful sets and publishNotReadyAddresses setting: https://github.com/k8ssandra/cass-operator/pull/18
There's an alternative solution in #198 which is fine for cases where TLS is disabled for brokers. Stable hostnames are required when using TLS to be able to do hostname verification for the certificates.
I made an experiment to add a new service and make the broker sts use this service: https://github.com/datastax/pulsar-helm-chart/commit/259341cedea8c905544b44019d7d13f30508f365
The problem is that it's not possible to change the serviceName for a STS:
Error: UPGRADE FAILED: cannot patch "pulsar-testenv-pulsar-broker" with kind StatefulSet: StatefulSet.apps "pulsar-testenv-pulsar-broker" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', 'updateStrategy', 'persistentVolumeClaimRetentionPolicy' and 'minReadySeconds' are forbidden
We would like to have 2 service for the broker STS:
- 1 service that uses
publishNotReadyAddresses: true
- another service that doesn't use
publishNotReadyAddresses: true
. This would be used to redirect traffic hitting the service only to brokers that pass the readiness probe.
It doesn't seem to be possible to keep backwards compatibility for existing deployments with the above requirements.
@lhotari To support the upgrade path, can you switch the purpose of the services? So you don't have to modify the StatefulSet, use the existing name for the service that does use publishNotReadyAddresses: true
setting and a new service that does? The proxy should point to the service that only routes traffic if the broker is ready, so that the proxy doesn't send traffic to a broker that can't handle it.