aws-sdk-java-v2 icon indicating copy to clipboard operation
aws-sdk-java-v2 copied to clipboard

DynamoDB connection pool tied up when interrupting

Open andrewyoo opened this issue 4 years ago • 1 comments

Describe the bug

In my service, I was time limiting a block of code which involved dynamodb queries and eventually after enough timeouts, I am seeing the following error: SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool. It appears that if i interrupt the ddb client, then the connection is forever tied up and there is no available connection to make further calls

After several runs, i noticed that if i run future.cancel(false) (no interrupt) vs future.cancel(true) the service remains stable, but the threads are allowed to finish. I verified with the java sdk client metrics that LeasedConcurrency goes up and never goes back down.

Expected behavior

When ddb client aborts due to interrupt (AbortedException), the http connection is released.

Current behavior

When ddb client aborts due to interrupt, the http connection is NOT released therefore eventually depleting the http pool connection.

Steps to Reproduce

Something like this:

final Future future = executorService.submit(() -> {
  // some longer running ddb calls
}

final Object response;
try {
    response = future.get(MS_LIMIT_TO_RESPOND, TimeUnit.MILLISECONDS);
} catch (TimeoutException exception) {
    LOG.warn("Took longer than allotted {} ms to generate response.", MS_LIMIT_TO_RESPOND, exception);
} catch (InterruptedException | ExecutionException exception) {
    LOG.error("Failed to generate response", exception);
} finally {
    // If .cancel(true), then the thread will try to be interrupted, causing the issue.
    future.cancel(true);
    LOG.info("Returning response {}", response);
}

Possible Solution

Other tickets i saw with connection pool timeouts were regarding s3 and closing the object to ensure connection is released. I think upon sdk AbortedException or whatever exception for interrupt, the ddb connection should be closed.

Context

I was trying to limit the execution time on my service. If it didn't complete within a time limit, it would return an empty response.

AWS Java SDK version used

2

JDK version used

1.8

Operating System and version

Amazon Linux

andrewyoo avatar Dec 22 '21 20:12 andrewyoo

My team is also encountering this problem. The investigation we have done seems to indicate that connections leased from the pool are mishandled in org.apache.http.impl.execchain.MainClientExec#execute

If the process is interrupted at the wrong time the connections will be lost. What can we do to fix this? Is it a known Apache client issue?

jocull avatar Mar 07 '22 15:03 jocull

@andrewyoo @jocull I'm sorry for losing track of this. Are you still experiencing the issue?

Are you closing the data stream after it's consumed from the query response? Issues with connections that are not being released are usually associated with the resources not being properly closed.

debora-ito avatar Mar 29 '23 18:03 debora-ito

@debora-ito I don't understand your question with regards to this ticket. In my use case, I had a ddb client (DynamoDbClient.create()) and i was interrupting a query. Because i was interrupting early, there was no response or data stream to close.

As for am I still experiencing it? I avoiding interrupting the ddb requests so I wouldn't have this issue, so i can't confirm if it still is a problem.

andrewyoo avatar Mar 30 '23 00:03 andrewyoo

Our situation was the same as Andrew's - setting an interrupt on the thread running the request. We resorted to making requests with the async SDK and blocking on the results, but I would honestly prefer not to. It has been a year and we have not tried this again.

I did mention above the suspect code in the Apache library. It's possible that has been patched now but I have not revisited Apache change logs.

jocull avatar Mar 30 '23 02:03 jocull

We released a fix via https://github.com/aws/aws-sdk-java-v2/pull/4066.

The fix is available on Java SDK version 2.20.83.

@andrewyoo @jocull I know the fix is not easy to test due to the nature of the issue and because you changed to async, but let us know of any other issues you find after upgrading to a newer version.

debora-ito avatar Jun 21 '23 18:06 debora-ito

@debora-ito We applied the synchronous SDK again with the new version and tested both locally and in a load tested environment. We could not reproduce the issue this time so I believe it fixed. Thank you! 😄

jocull avatar Jun 29 '23 16:06 jocull