pulsar [Bug] Brokers spuriously error in response to the PRODUCER command for newly added partitions to a topic

Search before asking

[X] I searched in the issues and found nothing similar.

Read release policy

[X] I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

Broker: Pulsar 3.1.2 Client: Pulsar v3.4.2

Using Zookeeper as the metadata store.

Minimal reproduce step

Connect to a partitioned topic with the C++ client
Update the topic to add a new partition.
The C++ client will attempt to create producers for the new partitions. These will often fail.

What did you expect to see?

Producers successfully created.

What did you see instead?

In the broker logs, the "Illegal topic partition name" error message.

Anything else?

I believe the issue is that when the broker responds to a PRODUCER command, it calls ServerCnx:: handleProducer, which calls BrokerService::getOrCreateTopic, which calls BrokerService::getTopic, which calls BrokerService::fetchPartitionedTopicMetadataAsync(TopicName topicName), which calls BrokerService::fetchPartitionedTopicMetadataAsync(TopicName topicName, boolean refreshCacheAndGet), with refreshCacheAndGet set to false. This means that NamespaceResources:: getPartitionedTopicMetadataAsync is called with refresh always false, which means that getAsync is called on NamespaceResources rather than refreshAndGetAsync. This means that the sync call on Zookeeper is not called before performing the read.

According to the Zookeeper Programmer Guide (https://zookeeper.apache.org/doc/r3.1.2/zookeeperProgrammers.html), "ZooKeeper does not guarantee that at every instance in time, two different clients will have identical views of ZooKeeper data. Due to factors like network delays, one client may perform an update before another client gets notified of the change. Consider the scenario of two clients, A and B. If client A sets the value of a znode /a from 0 to 1, then tells client B to read /a, client B may read the old value of 0, depending on which server it is connected to. If it is important that Client A and Client B read the same value, Client B should should call the sync() method from the ZooKeeper API method before it performs its read."

This seems to indicate that without doing the sync, the broker could get an out of date picture of the number of partitions the topic has, resulting in spuriously erroring. I believe that reading the partition metadata for handling the PRODUCER command should use authoritative reads (calling sync before performing the Zookeeper reads).

This bug filed against pulsar-cpp-client may be related: https://github.com/apache/pulsar-client-cpp/issues/319. The pulsar-cpp-client developers seem to have mitigated it by adding retries, but the broker-side failure should still not happen.

Are you willing to submit a PR?

[X] I'm willing to submit a PR!

Oct 14 '24 06:10 dwang-qm

Broker: Pulsar 3.1.2

@dwang-qm Pulsar 3.1.x is not supported. Please re-test with a supported version. That is currently Pulsar broker 3.0.7 or 3.3.2 . Thanks!

Oct 14 '24 06:10 lhotari

That's the version on our production environment. It would take a significant investment of hours to setup an environment to reproduce this in, since this appears to be a race condition. My observations on the cause of the bug are valid for current master since I reviewed the Pulsar source code. Could you at least tell me if my reasoning is sound (under "Anything else?") before I invest a large amount of time reproducing it? Thanks!

Oct 14 '24 16:10 dwang-qm

That's the version on our production environment. It would take a significant investment of hours to setup an environment to reproduce this in, since this appears to be a race condition. My observations on the cause of the bug are valid for current master since I reviewed the Pulsar source code. Could you at least tell me if my reasoning is sound (under "Anything else?") before I invest a large amount of time reproducing it? Thanks!

Thanks, @dwang-qm. That's understandable. In any case, thanks for reporting the issue, even it if it's for a version that the OSS project doesn't support.

In the OSS Project, we request reporters at least to try to test the most recent released versions to see if the problem is already fixed. In many cases problems are fixed in supported versions. Isolating the issue will be a useful contribution to the project and after isolating an issue, the fix is usually very easy to implement.

Oct 14 '24 16:10 lhotari

I believe the issue is that when the broker responds to a PRODUCER command, it calls ServerCnx:: handleProducer, which calls BrokerService::getOrCreateTopic, which calls BrokerService::getTopic, which calls BrokerService::fetchPartitionedTopicMetadataAsync(TopicName topicName), which calls BrokerService::fetchPartitionedTopicMetadataAsync(TopicName topicName, boolean refreshCacheAndGet), with refreshCacheAndGet set to false. This means that NamespaceResources:: getPartitionedTopicMetadataAsync is called with refresh always false, which means that getAsync is called on NamespaceResources rather than refreshAndGetAsync. This means that the sync call on Zookeeper is not called before performing the read.

Good observations @dwang-qm! There have been multiple issues in the past with topic metadata handling due to the undefined consistency model. 2 years ago I wrote a blog post about this and the "Metadata consistency issues from user’s point of view" part describes the state of problems at that time. For topic creation, deletion and partition modification, it might still be the case that operations aren't strongly consistent as you have observed. The current workarounds are around retries. Some improvements have been made over time in individual use cases, such as by using the ZK sync support, it was introduced in #18518. I guess solving the metadata inconsistencies should start by defining the consistency model. It's currently eventually consistent for many cases which is surprising for many users since it's not well defined.

Oct 14 '24 16:10 lhotari

Thanks for the response! When Pulsar 4.0 is released, we will upgrade to it and see if we can observe the issue. In the meantime, we're going to update the client, since it looks similar to the issue filed against the client I referenced. I did see many retries in the logs, albeit against the same broker, within a short period of time, and they all failed. Unfortunately, the application code is not in control of any retries necessary due to "eventual consistency," since the client tries to create the new producer automatically after it periodically polls and detects new partitions have been added.

If you don't mind answering, I noticed that the reason that the client was trying to talk to the broker to create the producer in the first place is because it got the broker (in my case pulsar-broker-35) from a LOOKUP it did earlier. If so, that's confusing because it means it thinks that broker owns the topic (in my case named xxxx-partition-1), but wouldn't that broker have to actively claim the topic somehow, in which case it should've had the correct metadata to see that there was now more than one partition? See, xxxx is originally a partitioned topic with only one partition (so with only xxxx-partition-0 as a valid topic name), but after the topic metadata update, the C++ client looked up the new partitions and was told to connect to xxxx-partition-1 on pulsar-broker-35. But pulsar-broker-35 thinks there's only one partition in the topic. How could it when it seems to have claimed xxxx-partition-1?

Oct 14 '24 22:10 dwang-qm

the first place is because it got the broker (in my case pulsar-broker-35) from a LOOKUP it did earlier. If so, that's confusing because it means it thinks that broker owns the topic (in my case named xxxx-partition-1), but wouldn't that broker have to actively claim the topic somehow, in which case it should've had the correct metadata to see that there was now more than one partition? See, xxxx is originally a partitioned topic with only one partition (so with only xxxx-partition-0 as a valid topic name), but after the topic metadata update, the C++ client looked up the new partitions and was told to connect to xxxx-partition-1 on pulsar-broker-35. But pulsar-broker-35 thinks there's only one partition in the topic. How could it when it seems to have claimed xxxx-partition-1?

The default lookup logic is in code in NamespaceService.findBrokerServiceUrl and searchForCandidateBroker methods: https://github.com/apache/pulsar/blob/9012422bcbaac7b38820ce545cd5a3b4f8b586d0/pulsar-broker/src/main/java/org/apache/pulsar/broker/namespace/NamespaceService.java#L473-L539 https://github.com/apache/pulsar/blob/9012422bcbaac7b38820ce545cd5a3b4f8b586d0/pulsar-broker/src/main/java/org/apache/pulsar/broker/namespace/NamespaceService.java#L573-L712 Then the client makes a lookup request (CommandLookupTopic in proto) with the authoritative flag, the target broker will attempt to claim the ownership of the topic without doing lookups. That's why the solution tolerates eventual consistency. Each lookup could get redirected multiple times until it hits the target broker. The target broker will also process the lookup request and when it's authoritative, it will attempt to claim the ownership. The partitioned topic metadata is a completely separate concern in Pulsar. The client needs that information for subscribing to multi-partitioned topics. The broker itself doesn't need that information to serve topics since an individual partition of a multi-partitioned topic behaves completely independently. I hope this explanation makes sense.

Oct 15 '24 14:10 lhotari

Thank you for the explanation and source code snippets! After reviewing them, I think I understand.

Suppose, there's a multi-partition topic called xxxx, originally with only one partition (so only xxxx-partition-0 exists). However, when an admin adds another partition, the topic metadata gets updated. The Pulsar client rechecks the topic metadata on an interval and when it discovers the new partition, it tries to create a producer to connect to the new partition (xxxx-partition-1). It first needs to perform a lookup of the broker service url. So it contacts a broker it already is connected to, say pulsar-broker-0, which eventually calls findBrokerServiceUrl. It finds that pulsar-broker-35 is actually the best suited by searchForCandidateBroker. That's not pulsar-broker-0, so it returns a redirect response to the client. The Pulsar client then contacts pulsar-broker-35 and repeats the lookup. pulsar-broker-35 eventually calls findBrokerServiceUrl, finds itself the best candidate with searchForCandidateBroker, and takes ownership of the topic. All good so far.

Then, the client sends the PRODUCER command to pulsar-broker-35, which does have ownership of the topic, which it will check near the top of BrokerService::loadOrCreatePersistentTopic. However, before it can do that, it performs that fetchPartitionedTopicMetadataAsync call and then checks if topicName.getPartitionIndex() < metadata.partitions. As you say, they're separate systems, and pulsar-broker-35 may in fact have acquired ownership of the topic without having an updated view of the partitioned topic metadata from Zookeeper. As you say, the findBrokerServiceUrl would not even be aware that xxxx-partition-1 is a partitioned topic and would never have had to check the special partitioned topic metadata of xxxx to acquire ownership of xxxx-partition-1.

Do you think this could happen? I think just changing

https://github.com/apache/pulsar/blob/9d2606d73b94b4a5e1b2ffcb1e9a3c25cf71edd4/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java#L1091

to return fetchPartitionedTopicMetadataAsync(topicNameEntity, true) would solve the issue. Do you agree?

The issue with just retrying is that I have observed retries against the same broker (necessarily! since it owns the topic) over the course of seconds and it still failing because it remains unsynced with zookeeper. Of course, it eventually succeeds, but seconds of delays causes an extreme build-up in our ingestion pipeline and stuff starts failing.

Thank you again for your continued attention and responses!

Oct 16 '24 19:10 dwang-qm

to return fetchPartitionedTopicMetadataAsync(topicNameEntity, true) would solve the issue. Do you agree?

@dwang-qm exactly. Well done! The main concern of this change is the chances of performance regressions. That's why it would need more thought. One possibility would be to have a policy level setting for this where this could be configured at the broker level, namespace level or topic level. That would avoid performance regressions since users facing these issues or having such use cases could change it to use sync when it's necessary for the use case.

Oct 16 '24 19:10 lhotari

Minimal reproduce step

Connect to a partitioned topic with the C++ client

Update the topic to add a new partition.

The C++ client will attempt to create producers for the new partitions. These will often fail.

What did you expect to see?

Producers successfully created.

What did you see instead?

In the broker logs, the "Illegal topic partition name" error message.

@dwang-qm one more clarification: is the main negative impact the broker logs? Does the creation of the producer from the client perspective succeed?

Oct 17 '24 03:10 lhotari

we have also encountered this issue in pulsar-3.0.6. Using c++ producer, after update topic partition, some producers keep throw AlreadyClosed Error. Until we restart the c++ client, it become ok.

I try to reproduce in test cluster but It seems hard to reproduce.

Besides, before update to 3.0.6, we use 2.9.4. And it never occur this issue.

Supplement：I guess the inconsistent partition metadata issue is always exist. But in pulsar-2.9.4, it would not be throw. After pulsar-3.0, the code has been modified and the issue is exposed. We can see the following code block in getTopic().

pulsar-2.9.4: 企业微信截图_b2b82240-8606-434a-9e0d-d43894494513

master branch: 企业微信截图_d006a060-b646-4790-83d6-9858bf9bfa2b

Dec 31 '24 03:12 TakaHiR07

@dwang-qm one more clarification: is the main negative impact the broker logs? Does the creation of the producer from the client perspective succeed?

Apologies, I didn't see this question until today. Alas, the creation of the producer from the client prespective did not succeed. Our logs indicated that our code kept continually retrying, but was not successful in reconnecting for more than a minute. I believe the C++ client will keep choosing the same broker to try to create the producer and it's unclear when the broker will update its cached idea of how many partitions there are in the topic (if ever).

Dec 31 '24 05:12 dwang-qm

@dwang-qm Have you test this issue is been fixed after upgrade to this pr https://github.com/apache/pulsar-client-cpp/issues/319?

Jan 06 '25 07:01 TakaHiR07