[fix] retry producer creation upon error after succssful topic lookup
Fixes #1138
Motivation
In the newPartitionProducer() function, there should be a retry of grabCnx(). It will be similar to the reconnectToBroker's grabCnx() retry logic.
Java producer has this retry logic.
At the producer creation call, after a successful topic lookup at grabCnx() in producer_partition.go, if there is a network issue before the COMMAND to create producer sent, the grabCnx() will exit without retry.
The same connectoToBroker retry logic is observed in this implementation.
We had frequent failures upon the initial producer creation under unstable network conditions .
It's tricky to reproduce. But we observe the problem more frequently on Azure pod's initialization stage. After implementing the grabCnx() retry in the newPartitionProducer(), the problem has gone away. The error often shows a connection closed (EOF) by the other side. But it's not by the broker (or Pulsar) based on the logs on the Pulsar side. It can be network issues in between the producer pod and the Pulsar cluster. That's why a grabCnx() retry is much needed.
System configuration
Pulsar version: 2.10
Modifications
In the newPartitionProducer() function, adding a retry of grabCnx() with the same retry logic specified in reconnectToBroker's grabCnx() retry logic.
Verifying this change
- [ x] Make sure that the change passes the CI checks.
This change is already covered by existing tests, such as (please describe tests).
Does this pull request potentially affect one of the following parts:
If yes was chosen, please highlight the changes
- Dependencies (does it add or upgrade a dependency): (no)
- The public API: (no)
- The schema: (no)
- The default values of configurations: (no)
- The wire protocol: (no)
Documentation
- Does this pull request introduce a new feature? (no)
- If yes, how is the feature documented? (not applicable)
Great work @zzzming! I'll review again after you reply to the question.
@nodece I fixed based on your review comments. CI does not seem to run. Does it require any approval to run CI?
Ci triggered
Ping @zzzming