aws-sdk-kotlin icon indicating copy to clipboard operation
aws-sdk-kotlin copied to clipboard

Memory Leak with DynamoDB because of KtorEngine

Open vgiguere opened this issue 3 years ago • 3 comments

Describe the bug

There is a slow memory leak when using DynamoDB

The origin of the leak seems to reside with the KtorEngine, which is used by smithy's SdkHttpClient. I have opened a ticket with them: https://youtrack.jetbrains.com/issue/KTOR-4823/Memory-Leak-on-KtorEngine-

This is what we have been experiencing (and repeated several times)

We have an application on which we have ramped up shadow traffic to DynamoDB at a few hundred TPS and within about a day, the leak grows to ~850M of heap memory not being collected. Taking a Heap Dump, the DynamoDBClient is holding on to 90% of the heap memory: hundreds of thousands of objects from the kotlinx.coroutines package that are not released.

Screen Shot 2022-08-26 at 12 41 00 PM

Expected behavior

We expect kotlin sdk to not leak memory

Current behavior

If you run a few hundred TPS for a day and take a heap dump, you will see that DynamoDBClient is leaking memory through the underlying KtorEngine

Screen Shot 2022-08-26 at 12 41 00 PM

Steps to Reproduce

Run an application in kotlin making use of coroutines/suspend functions and writing/reading to DynamoDB using the DynamoDBClient at a rate of a few hundred TPS - within a few hours, take a heap dump and look at which object is holding on to memory in the heap. The DynamoDBClient should be up there with thousands of references on coroutines.

Possible Solution

Work to KTOR folks to fix it?|

I have seen a similar but not identical bug on their tracking system https://youtrack.jetbrains.com/issue/KTOR-4288

I have myself opened up this one: https://youtrack.jetbrains.com/issue/KTOR-4823/Memory-Leak-on-KtorEngine-

If you could make sure the issue has traction, it would be great!

Context

We are looking to use the Kotlin SDK in production as soon as we find it to be stable enough. We understand that right now, it is still in early release, but if it were not for the leak, the client for DynamoDB is behaving very well and we'd be willing to try it in a production environment.

AWS Kotlin SDK version used

0.15.0, 0.16.0, 0.17.0-beta

Platform (JVM/JS/Native)

Java 11 (Correto) - Kubernetes

Operating System and version

Docker, Java 11.0.14.1

vgiguere avatar Aug 29 '22 18:08 vgiguere

Hi @vgiguere, thanks for the bug report. To confirm a few things:

The SDK's default HTTP engine was changed to be OkHttp (not Ktor) in 0.16.4-beta. We still provide a Ktor engine in the latest versions but it must be manually selected during client configuration. You mentioned that the issue occurs on 0.17.0-beta. When using that version, are you specifically configuring Ktor as an engine for DynamoDbClient? Does using the default OkHttp engine cause the same issue?

ianbotsf avatar Aug 29 '22 21:08 ianbotsf

Apologies - I guess my browser filled in that version field from a previous issue I had submitted and I did not notice. The version we tested and profiled was 0.16.0 - I will test 0.16.4-beta and hopefully the problem is gone.

Thank you ;)

vgiguere avatar Aug 30 '22 17:08 vgiguere

I cannot reproduce a strictly-increasing memory leak, even after running for several hours at high TPS. When taking periodic heap dumps, I do occasionally see 10K+ CombinedContext objects but subsequent dumps show far lower objects and memory usage. The memory usage/rentainment seems to fluctuate to a large degree, as I'd expect with highly-concurrent, high-bandwidth code.

@vgiguere Have you taken multiple heap dumps over the lifecycle of your application? Do they always show increasing (vs decreasing) utilization of CombinedContext?

It's also possible the concurrency method my test code uses differs significantly from your own. Can you provide minimal sample code which reproduces the problem?

Lastly, my test code ran with default JVM settings under OpenJDK 11. Can you confirm any non-default JVM settings you're using, particularly that might affect memory, threading, or garbage collection?

Thank you.

ianbotsf avatar Sep 01 '22 20:09 ianbotsf

It looks like this issue has not been active for more than 5 days. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please add a comment to prevent automatic closure, or if the issue is already closed please feel free to reopen it.

github-actions[bot] avatar Sep 06 '22 21:09 github-actions[bot]