pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[BUG] consumer or producer will create failed frequently when build PulsarClient with many unavailable broker nodes

Open AuroraTwinkle opened this issue 1 year ago • 5 comments

Search before asking

  • [X] I searched in the issues and found nothing similar.

Motivation

Reference Links: https://github.com/apache/pulsar/discussions/22933

Solution

In the implementation of the ServiceNameResolver interface, we can perform periodic node health check and corresponding removal or recovery operations to avoid selecting unhealthy nodes when resolvingHost!

Alternatives

No response

Anything else?

No response

Are you willing to submit a PR?

  • [X] I'm willing to submit a PR!

AuroraTwinkle avatar Jun 18 '24 06:06 AuroraTwinkle

@AuroraTwinkle This issue doesn't describe any problem, only a solution. Please add more context about what solution you are trying to solve.

lhotari avatar Jun 05 '25 10:06 lhotari

In the implementation of the ServiceNameResolver interface, we can perform periodic node health check and corresponding removal or recovery operations to avoid selecting unhealthy nodes when resolvingHost!

@AuroraTwinkle What problem occurs if an unhealthy node is selected? I would assume that in the current design, the next node would be selected after a connection attempt. Why would there be a need to change this?

lhotari avatar Jun 05 '25 10:06 lhotari

@AuroraTwinkle I see the referenced discussion https://github.com/apache/pulsar/discussions/22933. That's a good problem description. I think it's a bug when the next available serviceUrl isn't attempted. That's the whole point of having multiple endpoints. However the solution in #22935 has it's issues as I've commented on the PR. The general approach in the client should be that the connection is attempted and if it fails, it will pick the next endpoint.

I took a look at the code and it seems that client.getCnxPool().getConnection(socketAddress) calls should be using client.getCnxPool().getConnection(serviceNameResolver).

org.apache.pulsar.client.impl.ConnectionPool would need a new getConnection method that accepts the org.apache.pulsar.client.impl.ServiceNameResolver instance. The ServiceNameResolver interface will need some changes. When there's a failure for one address, there should be a method that can be called by ConnectionPool to provide feedback that the connection failed for that address. This way the ServiceNameResolver can remove it from the main pool of addresses for a configurable period of time. If all main pool addresses are offline, the resolver implementation could start getting random addresses from the offline pool. There would also be a need to have a success feedback method so that ConnectionPool could call it each time a connection was successful. Would you like to implement such solution instead of #22935?

lhotari avatar Jun 05 '25 10:06 lhotari

@AuroraTwinkle I see the referenced discussion #22933. That's a good problem description. I think it's a bug when the next available serviceUrl isn't attempted. That's the whole point of having multiple endpoints. However the solution in #22935 has it's issues as I've commented on the PR. The general approach in the client should be that the connection is attempted and if it fails, it will pick the next endpoint.

I took a look at the code and it seems that client.getCnxPool().getConnection(socketAddress) calls should be using client.getCnxPool().getConnection(serviceNameResolver).

org.apache.pulsar.client.impl.ConnectionPool would need a new getConnection method that accepts the org.apache.pulsar.client.impl.ServiceNameResolver instance. The ServiceNameResolver interface will need some changes. When there's a failure for one address, there should be a method that can be called by ConnectionPool to provide feedback that the connection failed for that address. This way the ServiceNameResolver can remove it from the main pool of addresses for a configurable period of time. If all main pool addresses are offline, the resolver implementation could start getting random addresses from the offline pool. There would also be a need to have a success feedback method so that ConnectionPool could call it each time a connection was successful. Would you like to implement such solution instead of #22935?

@lhotari Good suggestion! Removing abnormal nodes through the feedback mechanism can avoid frequent connection establishment and increase the system load. I will submit a new PR to implement it.

AuroraTwinkle avatar Jun 05 '25 14:06 AuroraTwinkle

@AuroraTwinkle I see the referenced discussion #22933. That's a good problem description. I think it's a bug when the next available serviceUrl isn't attempted. That's the whole point of having multiple endpoints. However the solution in #22935 has it's issues as I've commented on the PR. The general approach in the client should be that the connection is attempted and if it fails, it will pick the next endpoint.

I took a look at the code and it seems that client.getCnxPool().getConnection(socketAddress) calls should be using client.getCnxPool().getConnection(serviceNameResolver).

org.apache.pulsar.client.impl.ConnectionPool would need a new getConnection method that accepts the org.apache.pulsar.client.impl.ServiceNameResolver instance. The ServiceNameResolver interface will need some changes. When there's a failure for one address, there should be a method that can be called by ConnectionPool to provide feedback that the connection failed for that address. This way the ServiceNameResolver can remove it from the main pool of addresses for a configurable period of time. If all main pool addresses are offline, the resolver implementation could start getting random addresses from the offline pool. There would also be a need to have a success feedback method so that ConnectionPool could call it each time a connection was successful. Would you like to implement such solution instead of #22935?

@lhotari Hi, Lari : I have submitted a new PR to implement this solution. Would you please review it if you have time. Any suggestions are welcome. PIP-425: https://github.com/apache/pulsar/pull/24394 Implementation: https://github.com/apache/pulsar/pull/24387

AuroraTwinkle avatar Jun 07 '25 10:06 AuroraTwinkle