armeria [ISSUE-5470] Add feature to gracefully shutdown connections in HttpClientFactory

Motivation:

Add graceful shutdown feature in HttpClientFactory. The implementation has been referenced from the server side graceful shutdown.

Modifications:

Add clientConnectionDrainDurationMillis which is used to set client-side drain settings
When drainDurationMillis > 0 in HttpClientFactory, graceful shutdown is applied.

Result:

Closes #<5470>. (If this resolves the issue.)
By allowing a grace period for ongoing requests to finish before closing connections, users can ensure that their requests complete

Mar 06 '24 13:03 seonWKim

All committers have signed the CLA.

Mar 06 '24 13:03 CLAassistant

I wonder how to close the doors for new request(that client should send), when graceful shutdown is in progress.

Mar 06 '24 13:03 seonWKim

🔍 Build Scan® (commit: 6bd9f78e8eca370a7a8d04850b5f17d168e8c4f9)

Job name	Status	Build Scan®
build-windows-latest-jdk-19	✅	https://ge.armeria.dev/s/3ajyfh2oetchm
build-self-hosted-unsafe-jdk-8	✅	https://ge.armeria.dev/s/5vsozbnl7ffey
build-self-hosted-unsafe-jdk-19-snapshot-blockhound	✅	https://ge.armeria.dev/s/mvwdi4prsrhsk
build-self-hosted-unsafe-jdk-17-min-java-17-coverage	✅	https://ge.armeria.dev/s/eqqxnlrxxwsnq
build-self-hosted-unsafe-jdk-17-min-java-11	✅	https://ge.armeria.dev/s/lpztp6j6tqp6q
build-self-hosted-unsafe-jdk-17-leak	✅	https://ge.armeria.dev/s/ii5qoqcoeegsy
build-self-hosted-unsafe-jdk-11	✅	https://ge.armeria.dev/s/pdlnnkgtiahp4
build-macos-12-jdk-19	✅	https://ge.armeria.dev/s/n2aeobtodnylu

Mar 06 '24 13:03 github-actions[bot]

Codecov Report

Attention: Patch coverage is 96.22642% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 74.10%. Comparing base (b8eb810) to head (f7b91d7). Report is 208 commits behind head on main.

:exclamation: Current head f7b91d7 differs from pull request most recent head 6bd9f78

Please upload reports for the commit 6bd9f78 to get more accurate results.

Files	Patch %	Lines
.../linecorp/armeria/client/ClientFactoryBuilder.java	87.50%	0 Missing and 1 partial :warning:
...c/main/java/com/linecorp/armeria/common/Flags.java	75.00%	0 Missing and 1 partial :warning:

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #5489      +/-   ##
============================================
+ Coverage     73.95%   74.10%   +0.15%     
- Complexity    20115    21023     +908     
============================================
  Files          1730     1819      +89     
  Lines         74161    77442    +3281     
  Branches       9465     9889     +424     
============================================
+ Hits          54847    57390    +2543     
- Misses        14837    15392     +555     
- Partials       4477     4660     +183

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Mar 06 '24 14:03 codecov[bot]

I wonder how to close the doors for new request(that client should send), when graceful shutdown is in progress.

I believe we can add a boolean flag closing to the factory and raise an exception if the factory is used.

Mar 12 '24 06:03 minwoox

Hi @minwoox, please check if I've understood correctly.

I believe we can add a boolean flag closing to the factory and raise an exception if the factory is used.

Does it mean that we should add a closing flag to the ClientFactory and set it to true when draining starts? 🤔

raise an exception if the factory is used.

I'm not sure what method is being called when the ClientFactory starts to be used(Is it WebClient's execute(...) method?). Need some help 😄

Mar 22 '24 12:03 seonWKim

Does it mean that we should add a closing flag to the ClientFactory and set it to true when draining starts? 🤔

Oops, I realized that we don't have to add a flag. We can just use isClosing() method. https://github.com/line/armeria/blob/c2519e47fd691be2f8f2cbf38f31c1ce49363eb6/core/src/main/java/com/linecorp/armeria/client/DefaultClientFactory.java#L223

I'm not sure what method is being called when the ClientFactory starts to be used

I believe that we can add the logic to DefaultClientFactory.acquireEventLoop,DefaultClientFactory.newClient and HttpClientFactory.pool() methods.

HttpChannelPool pool(EventLoop eventLoop) {
    if (isClosing()) {
        throw new IllegalStateException("...");
    }
    ...

Please ensure that the exception is propagated to the callers when thread rescheduling is involved.

Mar 22 '24 14:03 minwoox

@minwoox , thank you for the detailed explanation. 👍

Oops, I realized that we don't have to add a flag. We can just use isClosing() method.

I've added cheking logic which uses isClosing() on above mentioned methods.

Please ensure that the exception is propagated to the callers when thread rescheduling is involved.

I think I've not understood above statement well. Please correct me if I'm wrong 😄

Thread rescheduling can occur when DefaultClientFactory.acquireEventLoop,DefaultClientFactory.newClient and HttpClientFactory.pool() are called.
Exceptions are not propagated when rescheduling occurs(because thread changes)

So, should we wrap every method that calls above method(with CompletableFutures or whatever that can propagate exceptions even though thread changes) like below and handle exceptions?

CompletableFuture.supplyAsync(() -> methodsToConsiderThreadRescheduling(...))
        .completeExceptionally(ex- > handleException(ex));

Mar 24 '24 08:03 seonWKim

So, should we wrap every method that calls above method

Apologize for the unclear comment. 😉 I've rather imagined something like this:

private void acquireConnectionAndExecute0(ClientRequestContext ctx, Endpoint endpoint,
                                          HttpRequest req, DecodedHttpResponse res,
                                          ClientConnectionTimingsBuilder timingsBuilder,
                                          ProxyConfig proxyConfig) {
    final PoolKey key = new PoolKey(endpoint, proxyConfig);
    final HttpChannelPool pool;
    try {
        pool = factory.pool(ctx.eventLoop().withoutContext());
    } catch (Throwable t) {
        final UnprocessedRequestException wrapped = UnprocessedRequestException.of(t);
        handleEarlyRequestException(ctx, req, wrapped);
        res.close(wrapped);
        return;
    }
    ...

Mar 25 '24 08:03 minwoox

@minwoox, I always appreciate the clear and concise explanation from you 🙏. I've applied the exception cheking on above mentioned methods.

Mar 30 '24 09:03 seonWKim

@minwoox I've added tests to check whether connection draining is working as expected.

Apr 05 '24 08:04 seonWKim

I'm curious, what kind of test failures did you get?

Below three tests fails when DEFAULT_CLIENT_CONNECTION_DRAIN_DURATION_MICROS is set other than 0. 😓

Apr 08 '24 10:04 seonWKim

As users can enable the feature anytime, tests should be performed when the value is enabled and disabled. Let me check why it fails. 🤔

Apr 08 '24 10:04 ikhoon

@ikhoon It seems like client side Http2GracefulConnectionShutdownHandler is sending GOAWAY frames twice

first GO_AWAY frame to signal clients that shutdown is imminent
second GO_AWAY frame to close the channel

The tests in Http2GoAwayTest seems to pass when I add logic to read the GO_AWAY frame twice like below

// Read a GOAWAY frame.
assertThat(readFrame(in).getByte(3)).isEqualTo(Http2FrameTypes.GO_AWAY);
assertThat(readFrame(in).getByte(3)).isEqualTo(Http2FrameTypes.GO_AWAY);

I think we should consider the client is sending GO_AWAY frame twice for graceful shutdown. Maybe we can check the values associated with the GO_AWAY frame. If the value of GO_AWAY frame is equal to Integer.MAX_VALUE, then it means indicates imminent shutdown and there will be a following GO_AWAY frame. And when the value is set tot the last stream ID, we can refer that as it's the last GO_AWAY frame.

Apr 08 '24 14:04 seonWKim

first GO_AWAY frame to signal clients that shutdown is imminent

I don't see we need to send GO_AWAY with Integer.MAX_VALUE for clients until we support server push. Integer.MAX_VALUE indicates that the remote peer should not open new streams. If server push is not supported, it would be a waste of traffic.

Let me discuss double GO_AWAY frames on the client side with other maintainers.

Apr 09 '24 05:04 ikhoon

@seonwoo960000 Had a chat with the team and it was not a good idea for sending double GOAWAY. 🙏 Instead, we can just use Netty's graceful shutdown timeout. I pushed a commit that:

Rename connectionDrainDurationMircors to http2GracefulShutdownTimeoutMillis
Set the value to clientConnectionHandler.gracefulShutdownTimeoutMillis() PTAL. 🙏

Apr 17 '24 03:04 minwoox

@seonwoo960000 Could you also review the changes as well? 😉

Apr 18 '24 09:04 minwoox

@seonWKim 👍 👍 👍

May 20 '24 08:05 minwoox