pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[improve][client] Add maxConnectionsPerHost and connectionMaxIdleSeconds to PulsarAdminBuilder

Open lhotari opened this issue 1 year ago • 7 comments

Fixes #22041

Motivation

See #22041 . Currently, when using the asynchronous interfaces of the Pulsar Admin client, there's no backpressure by the client itself and the client will keep on opening new connections to the broker to fulfill the in-progress requests. Eventually, the broker will hit the maxHttpServerConnections limit, which is 2048.

It's better to limit the number of connections from a single client. This PR sets the limit to 16 connections per host. The limit isn't called connectionsPerBroker since admin operations usually target a cluster address.

Modification

  • add maxConnectionsPerHost and connectionMaxIdleSeconds to PulsarAdminBuilder
  • also modify the default connectionMaxIdleSeconds from 60 seconds to 25 seconds
    • some firewalls/NATs have a timeout of 30 seconds and that's why 25 seconds is a better default - common firewall/NAT idle timeout is 60 seconds and since the check isn't absolute, a better default is 25 seconds to ensure that connections don't die because of firewall/NAT timeouts

Documentation

  • [x] doc
  • [ ] doc-required
  • [ ] doc-not-needed
  • [ ] doc-complete

lhotari avatar Apr 19 '24 13:04 lhotari

the setMaxConnectionsPerHost in async http client doesn't seem to behave as expected. Will check the errors

  Caused by: java.util.concurrent.CompletionException: org.apache.pulsar.client.admin.internal.http.AsyncHttpConnector$RetryException: Could not complete the operation. Number of retries has been exhausted. Failed reason: Too many connections: 16
  	at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:332)
  	at java.base/java.util.concurrent.CompletableFuture.uniApplyNow(CompletableFuture.java:674)
  	at java.base/java.util.concurrent.CompletableFuture.orApplyStage(CompletableFuture.java:1601)
  	at java.base/java.util.concurrent.CompletableFuture.applyToEither(CompletableFuture.java:2261)
  	at org.apache.pulsar.client.admin.internal.http.AsyncHttpConnector.retryOrTimeOut(AsyncHttpConnector.java:275)
  	at org.apache.pulsar.client.admin.internal.http.AsyncHttpConnector.apply(AsyncHttpConnector.java:234)

UPDATE: it's necessary to set acquireFreeChannelTimeout setting in AHC. Will find a way to set a proper default and have it configurable.

lhotari avatar Apr 22 '24 17:04 lhotari

There's a problem with backpressure handling with async requests in the Pulsar code base. Since this PR limits the Pulsar Admin client to 16 connections per host, it now shows up problems.

The namespace unloading is a good example: https://github.com/apache/pulsar/blob/d7d54522933b63f6a74ec7139c6dedebe8ad9149/pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/NamespacesBase.java#L829-L841

All bundles in the namespace are unloaded at once without limiting concurrency. There was a dev mailing list discussion about backpressure and Pulsar Admin API implementation in https://lists.apache.org/thread/03w6x9zsgx11mqcp5m4k4n27cyqmp271 . However we didn't come across resolving the problem.

a snippet from my email in that thread:

Touching upon Pulsar Admin REST API. The context for back pressure is a lot different in the Pulsar Admin REST API. Before the PIP-149 async changes, there was explicit backpressure in the REST API implementation. The system could handle a limited amount of work and it would process downstream work items one-by-one.

With "PIP-149 Making the REST Admin API fully async" (https://github.com/apache/pulsar/issues/14365), there are different challenges related to backpressure. It is usually about how to limit the in-progress work in the system. An async system will accept a lot of work compared to the previous solution and this accepted work will get processed in the async REST API backend eventually even when the clients have already closed the connection and sent a new retry. One possible solution to this issue is to limit incoming requests at the HTTP server level with features that Jetty provides for limiting concurrency. PRs https://github.com/apache/pulsar/pull/14353 and https://github.com/apache/pulsar/pull/15637 added this support to Pulsar. The values might have to be tuned to a lot lower values to prevent issues in practice. This is not a complete solution for REST API backend. It would be useful to also have a solution that would cancel down stream requests that are for incoming HTTP requests that no longer exist since the client stopped waiting for the response. The main down stream requests are towards the metadata store. It might also be necessary to limit the number of outstanding downstream requests. With batching in metadata store, that might not be an issue.

The solution for the namespace unloading issue is to have a way to limit the outstanding CompletableFutures that are in progress and use that as a way to "backpressure" the sending of new requests. The current solution of sending out all requests and then waiting for the results is a problematic solution since it doesn't use any sort of feedback from the system to adjust the speed. In other words, there's currently no proper backpressure solution for async Pulsar Admin calls within Pulsar broker.

I'll experiment with some ways to add backpressure to cases where a large amount of async calls are triggered and then results are waited.

lhotari avatar Apr 23 '24 06:04 lhotari

Another location without proper backpressure is namespace deletion: https://github.com/apache/pulsar/blob/d7d54522933b63f6a74ec7139c6dedebe8ad9149/pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/NamespacesBase.java#L282-L293

lhotari avatar Apr 23 '24 07:04 lhotari

example of creating partitions: https://github.com/apache/pulsar/blob/50121e7f7be541f45bb6dc976f51e30658b1cb8d/pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/AdminResource.java#L162-L171

This would need backpressure too. Let's say if you create a 100 partition topic, the broker might open 100 HTTP connections to create the topic partitions concurrently. This is problematic when the brokers are under heavy load.

lhotari avatar Apr 23 '24 07:04 lhotari

Looks like https://github.com/spotify/completable-futures/blob/master/src/main/java/com/spotify/futures/ConcurrencyReducer.java could be a useful solution to leverage.

lhotari avatar Apr 23 '24 08:04 lhotari

Noticed that there's a solution to run 1-by-1 using https://github.com/apache/pulsar/blob/ed599673c7e60ab5bb02e1fb0615a7ff8e5d6430/pulsar-common/src/main/java/org/apache/pulsar/common/util/FutureUtil.java#L179-L210 . However, I think that ConcurrencyReducer would be a better solution for most use cases.

lhotari avatar Apr 23 '24 09:04 lhotari

Another challenge is to cancel work that is queued in the system, but not waited by any clients. Newer Jersey clients have support for this. I noticed commit https://github.com/eclipse-ee4j/jersey/commit/96028068b6379ad923cf26ab018f372f3ea040f6 in Jersey. When the system is overloaded, request processing might be very slow so that clients get timeouts and retry requests. This will add more work to the system unless there's a solution that cancels the timed out tasks. That's why addressing this is also important part of the solution.

lhotari avatar Apr 23 '24 09:04 lhotari

Codecov Report

Attention: Patch coverage is 65.67164% with 46 lines in your changes missing coverage. Please review.

Project coverage is 74.56%. Comparing base (bbc6224) to head (4a2c01f). Report is 501 commits behind head on master.

Files Patch % Lines
...client/admin/internal/http/AsyncHttpConnector.java 64.70% 17 Missing and 13 partials :warning:
...he/pulsar/client/admin/internal/FunctionsImpl.java 26.66% 11 Missing :warning:
.../client/admin/internal/PulsarAdminBuilderImpl.java 60.00% 3 Missing and 1 partial :warning:
...che/pulsar/client/admin/internal/PackagesImpl.java 92.30% 0 Missing and 1 partial :warning:
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #22541      +/-   ##
============================================
+ Coverage     73.57%   74.56%   +0.98%     
- Complexity    32624    33586     +962     
============================================
  Files          1877     1919      +42     
  Lines        139502   144229    +4727     
  Branches      15299    15774     +475     
============================================
+ Hits         102638   107542    +4904     
+ Misses        28908    28448     -460     
- Partials       7956     8239     +283     
Flag Coverage Δ
inttests 27.66% <41.04%> (+3.07%) :arrow_up:
systests 24.78% <42.53%> (+0.45%) :arrow_up:
unittests 73.90% <65.67%> (+1.05%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
.../pulsar/client/admin/internal/PulsarAdminImpl.java 78.09% <100.00%> (+0.42%) :arrow_up:
...apache/pulsar/client/admin/internal/SinksImpl.java 61.00% <100.00%> (ø)
...ache/pulsar/client/admin/internal/SourcesImpl.java 54.10% <100.00%> (ø)
.../org/apache/pulsar/client/impl/ConnectionPool.java 76.44% <ø> (+1.92%) :arrow_up:
...lsar/client/impl/conf/ClientConfigurationData.java 96.72% <100.00%> (+0.02%) :arrow_up:
...che/pulsar/client/admin/internal/PackagesImpl.java 79.06% <92.30%> (-0.72%) :arrow_down:
.../client/admin/internal/PulsarAdminBuilderImpl.java 75.78% <60.00%> (-9.35%) :arrow_down:
...he/pulsar/client/admin/internal/FunctionsImpl.java 44.92% <26.66%> (+0.74%) :arrow_up:
...client/admin/internal/http/AsyncHttpConnector.java 78.45% <64.70%> (-7.97%) :arrow_down:

... and 484 files with indirect coverage changes

codecov-commenter avatar Aug 07 '24 16:08 codecov-commenter