openwhisk icon indicating copy to clipboard operation
openwhisk copied to clipboard

Memory leak in `akka.actor.LocalActorRef`

Open YevSent opened this issue 2 years ago • 14 comments

Summary

I'm working on upgrading OpenWhisk to Akka 2.6.20 and Scala 2.13 and experienced the issue with OpenWhisk invokers consuming all available g1-old heap size after running for a couple of days with active traffic.

Doing heap profiling I got the following suggestions from Heap Hero:

One instance of akka.actor.LocalActorRef loaded by jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x3c05c5018
occupies 20,136,784 (18.14%) bytes.
The memory is accumulated in one instance of scala.collection.immutable.RedBlackTree$Tree,
loaded by jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x3c05c5018, which occupies 20,132,728 (18.14%) bytes.

Further analysis with Eclipse Memory Analyzer shows the following: Screenshot 2023-08-03 at 11 26 51 AM Screenshot 2023-08-03 at 11 29 54 AM

01

Environment details:

  • Scala 2.13
  • Akka 2.6.20
  • Akka HTTP 10.2.10
  • Akka Management 1.1.4

Any suggestions on where I should look to find the root cause of this memory leak?

YevSent avatar Aug 03 '23 16:08 YevSent

I believe akka 2.6.20 is the first release under the non-open source BSL license, not Apache v2. Therefore changes to update OpenWhisk to akka 2.6.20 cannot be accepted by the Apache OpenWhisk project.

dgrove-oss avatar Aug 03 '23 16:08 dgrove-oss

@dgrove-oss 2.6.20 is still Apache. >2.7.x is BSL. They actually released another patch a couple months ago 2.6.21 to fix a TLS bug.

Apache Pekko has started doing official releases over the last month. Once we get on to 2.6.20 we can start discussing migrating the project to Pekko. So far the core modules, http, and kafka have been released. They’re about to do management and then the rest of the connectors. I think there should be releases for everything by September at the pace they’re going.

For the topic of this memory leak, more information is needed. Is the memory leak only with 2.6.20? Can you reproduce off master? Are you using the new scheduler which uses the v2 FPCInvoker or the original invokers?

bdoyle0182 avatar Aug 03 '23 16:08 bdoyle0182

Cool, @bdoyle0182 thanks for clarifying. I had found an old post that said 2.6.19 was the last Apache version and 2.6.20 and beyond were going to be BSL.

A strategy of getting to the most recent Apache licensed version from Lightbend and then switching to Pekko sounds right to me.

dgrove-oss avatar Aug 03 '23 16:08 dgrove-oss

@bdoyle0182 we are migrating our project from Akka 2.5.26, on this version there is no memory leak. As our project has some slight modifications to the OpenWhisk, I'm not able to use the OpenWhisk master branch to run the same load and collect heap dumps. We use original invokers.

YevSent avatar Aug 03 '23 17:08 YevSent

Apache Pekko, a fork of Akka 2.6 has been released. v1.0.1 is out - very similar to Akka 2.6.21.

https://pekko.apache.org/docs/pekko/current/project/migration-guides.html

pjfanning avatar Aug 05 '23 18:08 pjfanning

@joni-jones Is there any chance you provide a self contained reproducer?

He-Pin avatar Aug 05 '23 18:08 He-Pin

If you want to raise a Pekko issue about this, someone may be able to help.

https://github.com/apache/incubator-pekko

pjfanning avatar Aug 05 '23 18:08 pjfanning

Since the strings are all IP addresses and it is below the stream materializer, this could be incoming connections that are hanging / not cleaned up (without knowing anything about openwhisk). Hard to say without knowing anything about the setup.

jrudolph avatar Aug 07 '23 18:08 jrudolph

@jrudolph I'm looking at these graphs and strings with IPs having 0% in comparison to RedBlackTree allocation. But I'm still looking if it could be an issue.

I see that these RedBlackTree have flow-*-0-ignoreSink as a value.

YevSent avatar Aug 07 '23 20:08 YevSent

What you are probably looking at is the child actors of the materializer actor where one actor is spawned for every stream you run. So, it might be a bit hard to see what the actual issue is because the memory might be spread over all these actors. One way to go about it would be to see the output of a class histogram just over the elements referenced by that children tree and see what kind of data is in there.

jrudolph avatar Aug 11 '23 09:08 jrudolph

Thanks @jrudolph. Yes, I tried to go down through these trees and leaves are pointing to child actors and ignore-sink.

Screenshot 2023-08-14 at 3 51 48 PM

I don't know if it's related, but some time ago when OpenWhisk was upgraded from 2.5.x Akka to 2.6.12 and the actor materialized has been removed there was a materializer.shutdown() https://github.com/apache/openwhisk/pull/5065/files#diff-e0bd51cbcd58c3894e1ffa4894de22ddfd47ae87352912de0e30cd60db315758L131-R130. I don't know all the internals of Materializer, but if such method was used to destroy all related actors is it possible that after it being removed on connection.shutdown some actors might hang up?

The version that we are upgrading from still uses 2.5.x Akka and we don't have issues with memory there.

YevSent avatar Aug 14 '23 21:08 YevSent

It seems the issue in https://github.com/apache/openwhisk/blob/master/common/scala/src/main/scala/org/apache/openwhisk/http/PoolingRestClient.scala#L76, without materializer.shutdown() removed by Akka upgrade to 2.6.12 it leaks memory. Also, OverflowStrategy.dropNew has been deprecated in 2.6.11, and underneath the queue for the same behavior has been changed from SourceQueueWithComplete to BoundedSourceQueueStage which looks like without proper clean up of materialized resources doesn't free up the memory.

In our implementation, we use a wrapper on top of PoolingRestClient for HTTP communication between invokers and actions pods instead of OW ApacheBlockingContainerClient.

I did a couple of different implementations, including:

  1. Use OverflowStrategy.dropHead to continue using SourceQueueWithComplete instead of the new BoundedSourceQueueStage with extra logic on shutdown and no memory leaks were observed.
  2. Continuing using OverflowStrategy.dropNew with no changes for shutdown seems to be leaking memory.
  3. Use of the queue with BoundedSourceQueueStage but with proper clean up on shutdown by using KillSwitch and queue.complete seems to be working fine as well with no memory issues.

YevSent avatar Aug 21 '23 21:08 YevSent

@joni-jones Thanks for sharing the update.

He-Pin avatar Aug 22 '23 03:08 He-Pin

It looks like I was able to fix the memory leak and it was stable on our production so far. I will be working on the PR shortly, as I believe it happens due to improper resource cleanup in PoolingRestClient.

YevSent avatar Sep 06 '23 17:09 YevSent