identus-cloud-agent Memory leak

Is this a regression?

Yes

Description

It seems like the prism-agents (tested on 1.19.1) have a memory leak. I’m running a single agent on a Ubuntu 22.04 using the “docker-compose.yml” setup as described on the readme of the Github page. The memory usage of that single agent was slowly increasing around 1GB a day – and that’s only a single agent at idle with no user-interaction at all. I haven’t investigated, if it is the agent, the node or some other component, but it’s something someone should look into. See screenshot. I’m currently also testing 1.24.0, but looks like, it’s also leaking at the same rate.

Please provide the exception or error you saw

No response

Please provide the environment you discovered this bug in

No response

Anything else?

No response

Jan 23 '24 11:01 bsandmann

Thanks @bsandmann - our agents running in Kubernetes haven't seen similar behaviour, this isn't to say there isn't a leak, it might just be the constraints and garbage collection are hiding it - checking the docker-compose.yml - we don't set any restrictions on memory usage - I'll run this test on a Ubunutu host now and report back in a few days to confirm the issue

Jan 23 '24 11:01 ghost

Update: I've run agents using the docker compose file included within the repo for several days at a time. I can observe a build up of memory usage but garbage collection kicks in and reduces it, the GC saw-tooth pattern is present so all looks good. I don't want to close this just yet until I've done a specific test to concretely prove there is no leak - as such - I'll be soak testing some agents towards the end of this week and I'll post the results on this ticket.

I'll leave the agents idle for 24 hours, soak test for 24 hours (if possible, may need to be shorter) and then leave idle again for 24 hours

Feb 06 '24 11:02 ghost

Update: I've had issues running these tests (due to local hardware, nothing to do with Agent) and need to repeat them

Feb 20 '24 14:02 ghost

Here is the development over a month (continuation of the earlier image/test)

Feb 20 '24 14:02 bsandmann

tl;dr

To investigate a memory leak I care about the part where this graft is stable. So the part that the memory used by the JVM doesn't increase.

After that:

I want to know how frequently does the garbage collection runs. Does the frequently increases? Does the memory use come always to the same base or is the base line increasing, when the garbage collection does a full collection (not a quick one)?

@davidpoltorak-io So my suggestion would be to limit the memory of the JVM to 4/5 or 6 GB (the minimum for the system to be stable and where the garbage collection starts working). Let's print GC stats/details.

Not sure what is the right flag we need to start the JVM to print this information. Maybe -Xlog:gc if it's not too verbose. (Also, let's config this log:gc to write to a file instead console (standard output)

Feb 20 '24 16:02 FabioPinheiro

@davidpoltorak-io , do you have a contributor account to reassign this issue to? cc @yshyn-iohk @mineme0110

Mar 01 '24 09:03 mkbreuningIOHK

@bsandmann, we have a screenshot from the Grafana dashboard: One of the agents in our SIT environment (last three days):

And a similar screenshot for the same agent over the last seven days:

These pictures don't look like a memory leak issue.

Could you share how you run the agent and any other essential details for reproducing it?

Mar 26 '24 13:03 yshyn-iohk

@yshyn-iohk Thanks for taking a look at it.

I'm following the Quick-Start instructions without any modifications on an Ubuntu 22.04 installation. I haven't looked deeper into the issue yet, but I've noticed this behavior on similar setups with one or multiple agents. Some ideas:

It might not be the agent itself causing the problem but rather some other component that is started alongside it.
The issue could also be related to the virtualization setup (Proxmox) being somewhat incompatible with something inside the Docker containers. However, I'm running similar VMs with other images just fine.
I'm also running out of disk space on those VMs with the agents. The agent has used up a few GB over the last weeks, just idling, and now it's starving the VM of all its disk space. I just noticed this now. There's nothing else but the agent running on that VM, and it must have filled up 32GB in a few weeks. I haven't had the time to look into that yet. Any ideas?

Mar 26 '24 15:03 bsandmann

please run docker container stats to check how much ram the agent containers are actually using. I think they shouldn't be nowhere near that..

What I think may be happening is docker itself basically eating all the available RAM on the vm, you can limit how much ram can docker use of course, but by default it will be happy to use all that's available and reserve it for the containers.

https://docs.docker.com/config/containers/resource_constraints/

I realized I have the same problem on my test deploy :)

Apr 05 '24 18:04 robertocarvajal

I have a VM now running for 12 days. On three occasions I used the commands htop, free, docker container stats, and df to get a slightly better picture of what is going on. This is obviously far from an in-depth analysis but gives a rough picture. Identus_ressource_usage I haven't looked at it in detail, but there are some first conclusions:

The problem isn't as big as initially thought, due to the fact that Proxmox can't look into the Linux VM to see the actual memory allocation. This can cause it to display a higher memory footprint than is actually used. Nonetheless, this is not generally happening for all Linux VM workloads and still might point to some unusual behavior. Which brings us to:
There is still a small memory increase noticeable over the 12 days as well as a disk space increase. I'll further let it run to see if the memory increase persists or if it is just a zigzag pattern which isn't detectable with just three snapshots.

If you have any comments on which other data to collect from those VMs, let me know. I'm planning to let the VM run for another 1-2 weeks and maybe we get something out of it.

May 14 '24 11:05 bsandmann

Thanks, @bsandmann, for the additional information! I will also check the memory footprint of the Cloud Agent v1.33.0 in our SIT environment and update this ticket later this week.

May 21 '24 14:05 yshyn-iohk

@yshyn-iohk Here is an additional data-capture from the VM: Identus_ressource_usage2 A few things are worth noticing:

The GC seems to keep the increasing memory footprint in check. So technically this can't really be classified as a memory leak in my opinion. Nonetheless the memory allocation pattern is a bit unusual and likely more than necessary for an application at idle. Without having looked at the code I would say it is due to an sub-optimal handling of logs or strings.
Between 20th and 28th the memory footprint of all applications stayed somewhat identical. Did we reach a peak or is just a lucky snapshot?
The disk-usage further increased: When my calculation is correct, each instance (I'm running 2 agents) is writing about 45MB of data to the disk daily. That's nearly 1,5GB a month and around 17GB a year ... at idle. I guess the memory allocation issue is closely related to this creation of data - properly logs.

May 28 '24 07:05 bsandmann

Thanks for additional details, @bsandmann.

The first notice is a general behavior of the JVM: it consumes as much memory as possible for the given Xmx configuration until it triggers the GC. Connecting to the application using the JVM profiler and manually triggering the garbage collection to ensure some memory is free can help you ensure it works as expected.

The second notice I can explain by looking at the chart of the memory footprint. If you measure the memory allocation from time to time using the htop, there might be snapshots when the memory looks the same. But usually, it moves from the minimum to the maximum value and back when GC collects the orphan objects.

The disk usage is out of the scope of this issue, but I can also elaborate on this here. The Agent has a couple of background jobs and keeps the pools of connections to the DB, so it's obvious that the logs will constantly grow. Log quality and content are a matter of optimization and tuning: improving the logic and writing meaningful logs only while filtering out the noisy entries from the code and 3rd party libraries is okay.

We're collecting the logs using Loki, and as far as I know, the log files don't occupy the disk space at the host machine. I will clarify this question with the Platform Engineering team, and if needed, the corresponding ticket about the logs rotation/retention configuration will be created.

I'm tending to close this issue, as there is no evidence of the memory leak. One of the best indicators of the memory leak in the JVM application is the application failure with the error OutOfMemoryError. Fortunately, we don't see this error.

Jun 17 '24 08:06 yshyn-iohk

Thanks @yshyn-iohk for looking into this. From my perspective we can close this. Should we open a new ticket for providing options to limit the output from the logs or is this already covered somewhere else?

@davepoltorak Last time we spoke you had some ideas regarding the source of this issue. Maybe you find the time to quickly eloborate on that, so that doesn't get lost. Maybe someone will pick up this threat at some other point in time.

Jul 09 '24 14:07 bsandmann

The issue is closed as there is no activity on it, and @bsandmann confirmed that it can be closed.

Oct 17 '24 10:10 yshyn-iohk