armeria icon indicating copy to clipboard operation
armeria copied to clipboard

[ISSUE-5470] Add feature to gracefully shutdown connections in HttpClientFactory

Open seonWKim opened this issue 1 year ago โ€ข 17 comments

Motivation:

Add graceful shutdown feature in HttpClientFactory. The implementation has been referenced from the server side graceful shutdown.

Modifications:

  • Add clientConnectionDrainDurationMillis which is used to set client-side drain settings
  • When drainDurationMillis > 0 in HttpClientFactory, graceful shutdown is applied.

Result:

  • Closes #<5470>. (If this resolves the issue.)
  • By allowing a grace period for ongoing requests to finish before closing connections, users can ensure that their requests complete

seonWKim avatar Mar 06 '24 13:03 seonWKim

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Mar 06 '24 13:03 CLAassistant

I wonder how to close the doors for new request(that client should send), when graceful shutdown is in progress.

seonWKim avatar Mar 06 '24 13:03 seonWKim

๐Ÿ” Build Scanยฎ (commit: 6bd9f78e8eca370a7a8d04850b5f17d168e8c4f9)

Job name Status Build Scanยฎ
build-windows-latest-jdk-19 โœ… https://ge.armeria.dev/s/3ajyfh2oetchm
build-self-hosted-unsafe-jdk-8 โœ… https://ge.armeria.dev/s/5vsozbnl7ffey
build-self-hosted-unsafe-jdk-19-snapshot-blockhound โœ… https://ge.armeria.dev/s/mvwdi4prsrhsk
build-self-hosted-unsafe-jdk-17-min-java-17-coverage โœ… https://ge.armeria.dev/s/eqqxnlrxxwsnq
build-self-hosted-unsafe-jdk-17-min-java-11 โœ… https://ge.armeria.dev/s/lpztp6j6tqp6q
build-self-hosted-unsafe-jdk-17-leak โœ… https://ge.armeria.dev/s/ii5qoqcoeegsy
build-self-hosted-unsafe-jdk-11 โœ… https://ge.armeria.dev/s/pdlnnkgtiahp4
build-macos-12-jdk-19 โœ… https://ge.armeria.dev/s/n2aeobtodnylu

github-actions[bot] avatar Mar 06 '24 13:03 github-actions[bot]

Codecov Report

Attention: Patch coverage is 96.22642% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 74.10%. Comparing base (b8eb810) to head (f7b91d7). Report is 208 commits behind head on main.

:exclamation: Current head f7b91d7 differs from pull request most recent head 6bd9f78

Please upload reports for the commit 6bd9f78 to get more accurate results.

Files Patch % Lines
.../linecorp/armeria/client/ClientFactoryBuilder.java 87.50% 0 Missing and 1 partial :warning:
...c/main/java/com/linecorp/armeria/common/Flags.java 75.00% 0 Missing and 1 partial :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5489      +/-   ##
============================================
+ Coverage     73.95%   74.10%   +0.15%     
- Complexity    20115    21023     +908     
============================================
  Files          1730     1819      +89     
  Lines         74161    77442    +3281     
  Branches       9465     9889     +424     
============================================
+ Hits          54847    57390    +2543     
- Misses        14837    15392     +555     
- Partials       4477     4660     +183     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Mar 06 '24 14:03 codecov[bot]

I wonder how to close the doors for new request(that client should send), when graceful shutdown is in progress.

I believe we can add a boolean flag closing to the factory and raise an exception if the factory is used.

minwoox avatar Mar 12 '24 06:03 minwoox

Hi @minwoox, please check if I've understood correctly.

I believe we can add a boolean flag closing to the factory and raise an exception if the factory is used.

Does it mean that we should add a closing flag to the ClientFactory and set it to true when draining starts? ๐Ÿค”

raise an exception if the factory is used.

I'm not sure what method is being called when the ClientFactory starts to be used(Is it WebClient's execute(...) method?). Need some help ๐Ÿ˜„

seonWKim avatar Mar 22 '24 12:03 seonWKim

Does it mean that we should add a closing flag to the ClientFactory and set it to true when draining starts? ๐Ÿค”

Oops, I realized that we don't have to add a flag. We can just use isClosing() method. https://github.com/line/armeria/blob/c2519e47fd691be2f8f2cbf38f31c1ce49363eb6/core/src/main/java/com/linecorp/armeria/client/DefaultClientFactory.java#L223

I'm not sure what method is being called when the ClientFactory starts to be used

I believe that we can add the logic to DefaultClientFactory.acquireEventLoop,DefaultClientFactory.newClient and HttpClientFactory.pool() methods.

HttpChannelPool pool(EventLoop eventLoop) {
    if (isClosing()) {
        throw new IllegalStateException("...");
    }
    ...

Please ensure that the exception is propagated to the callers when thread rescheduling is involved.

minwoox avatar Mar 22 '24 14:03 minwoox

@minwoox , thank you for the detailed explanation. ๐Ÿ‘

Oops, I realized that we don't have to add a flag. We can just use isClosing() method.

I've added cheking logic which uses isClosing() on above mentioned methods.

Please ensure that the exception is propagated to the callers when thread rescheduling is involved.

I think I've not understood above statement well. Please correct me if I'm wrong ๐Ÿ˜„

  • Thread rescheduling can occur when DefaultClientFactory.acquireEventLoop,DefaultClientFactory.newClient and HttpClientFactory.pool() are called.
  • Exceptions are not propagated when rescheduling occurs(because thread changes)

So, should we wrap every method that calls above method(with CompletableFutures or whatever that can propagate exceptions even though thread changes) like below and handle exceptions?

CompletableFuture.supplyAsync(() -> methodsToConsiderThreadRescheduling(...))
        .completeExceptionally(ex- > handleException(ex)); 

seonWKim avatar Mar 24 '24 08:03 seonWKim

So, should we wrap every method that calls above method

Apologize for the unclear comment. ๐Ÿ˜‰ I've rather imagined something like this:

private void acquireConnectionAndExecute0(ClientRequestContext ctx, Endpoint endpoint,
                                          HttpRequest req, DecodedHttpResponse res,
                                          ClientConnectionTimingsBuilder timingsBuilder,
                                          ProxyConfig proxyConfig) {
    final PoolKey key = new PoolKey(endpoint, proxyConfig);
    final HttpChannelPool pool;
    try {
        pool = factory.pool(ctx.eventLoop().withoutContext());
    } catch (Throwable t) {
        final UnprocessedRequestException wrapped = UnprocessedRequestException.of(t);
        handleEarlyRequestException(ctx, req, wrapped);
        res.close(wrapped);
        return;
    }
    ...

minwoox avatar Mar 25 '24 08:03 minwoox

@minwoox, I always appreciate the clear and concise explanation from you ๐Ÿ™. I've applied the exception cheking on above mentioned methods.

seonWKim avatar Mar 30 '24 09:03 seonWKim

@minwoox I've added tests to check whether connection draining is working as expected.

seonWKim avatar Apr 05 '24 08:04 seonWKim

I'm curious, what kind of test failures did you get?

Below three tests fails when DEFAULT_CLIENT_CONNECTION_DRAIN_DURATION_MICROS is set other than 0. ๐Ÿ˜“ image

seonWKim avatar Apr 08 '24 10:04 seonWKim

As users can enable the feature anytime, tests should be performed when the value is enabled and disabled. Let me check why it fails. ๐Ÿค”

ikhoon avatar Apr 08 '24 10:04 ikhoon

@ikhoon It seems like client side Http2GracefulConnectionShutdownHandler is sending GOAWAY frames twice

  • first GO_AWAY frame to signal clients that shutdown is imminent
  • second GO_AWAY frame to close the channel

The tests in Http2GoAwayTest seems to pass when I add logic to read the GO_AWAY frame twice like below

// Read a GOAWAY frame.
assertThat(readFrame(in).getByte(3)).isEqualTo(Http2FrameTypes.GO_AWAY);
assertThat(readFrame(in).getByte(3)).isEqualTo(Http2FrameTypes.GO_AWAY);

I think we should consider the client is sending GO_AWAY frame twice for graceful shutdown. Maybe we can check the values associated with the GO_AWAY frame. If the value of GO_AWAY frame is equal to Integer.MAX_VALUE, then it means indicates imminent shutdown and there will be a following GO_AWAY frame. And when the value is set tot the last stream ID, we can refer that as it's the last GO_AWAY frame.

seonWKim avatar Apr 08 '24 14:04 seonWKim

first GO_AWAY frame to signal clients that shutdown is imminent

I don't see we need to send GO_AWAY with Integer.MAX_VALUE for clients until we support server push. Integer.MAX_VALUE indicates that the remote peer should not open new streams. If server push is not supported, it would be a waste of traffic.

Let me discuss double GO_AWAY frames on the client side with other maintainers.

ikhoon avatar Apr 09 '24 05:04 ikhoon

@seonwoo960000 Had a chat with the team and it was not a good idea for sending double GOAWAY. ๐Ÿ™ Instead, we can just use Netty's graceful shutdown timeout. I pushed a commit that:

  • Rename connectionDrainDurationMircors to http2GracefulShutdownTimeoutMillis
  • Set the value to clientConnectionHandler.gracefulShutdownTimeoutMillis() PTAL. ๐Ÿ™

minwoox avatar Apr 17 '24 03:04 minwoox

@seonwoo960000 Could you also review the changes as well? ๐Ÿ˜‰

minwoox avatar Apr 18 '24 09:04 minwoox

@seonWKim ๐Ÿ‘ ๐Ÿ‘ ๐Ÿ‘

minwoox avatar May 20 '24 08:05 minwoox