connect High memory usage/possible memory leak

We use 1Password connect and operator with the connect helm chart across several Kubernetes clusters and have noticed very high memory usage (up to 32GB on a large cluster). In our monitoring we can see that the memory consumption goes up steadily even when no new 1Password items are created. Could this be a memory leak? 1Password connect version: 1.2.0 1Password operator version: 1.0.1 1Password connect helm chart version: 1.4.0 Kubernetes versions: between 1.18.5 and 1.19.10

Dec 08 '21 09:12 simoncolincap

Hi @simoncolincap,

Thank you for reaching out 🙌 What you describe definitely sounds like it's worth investigating.

To reproduce this locally, some extra information would be really helpful:

Can you see if the memory consumption of the api or sync container is increasing? Or both?
What is the rate of the memory increase? (e.g. 10MB/hour)
Do you have a rough idea of how many requests the Connect instance is receiving from clients per hour?
Is there a specific kind of request that is used a lot? (an SDK function that is used a lot also works)

If you could try to answer these questions as well as possible, that would help us a great deal in locating the issue 😃

Joris

Dec 08 '21 09:12 jpcoenen

Hi, The memory consumption of both sync and api is increasing with the api container using around twice the memory of the sync container. On most of our clusters we have less than 10 secrets that come from 1Password and the memory increases steadily at around 10MB/day but it will sometimes jump up by tens of MBs at once. If we restart the connect server memory usage will be low again but will still increase with time. The only requests to connect are coming from the 1password operator, in the logs there's usually between one and two requests per second. The 32GB memory usage was on a larger cluster where there are many 1Password secrets (more than 1000, 1300 at the moment) that are being created and destroyed automatically (around 100 creations and destructions a day) but even then this seems like a lot of memory.

Dec 17 '21 06:12 simoncolincap

Thank you for the extra details. That will definitely help us reproduce the issue.

I just noticed that you're not on the latest version of Connect and the operator. Could you try updating to the latest version of Connect (v1.5.0) so we can confirm whether the issue is still present?

Dec 17 '21 11:12 jpcoenen

We upgraded the Connect Helm chart to 1.5.0 which upgraded the 1Password Operator to 1.1.0 and 1Password Connect to 1.5.0. The issue is still there but it looks like the memory consumption slowed down slightly, it's closer to 7MB per day now on the normal clusters. On our largest cluster the memory usage is still very high, it gains around 10GB a day.

Feb 02 '22 09:02 simoncolincap

Hey! I wanted to let you know that we released v1.5.2 of Connect a while back, which addresses a memory issue we found.

Let me know if that fixes the problem you're seeing.

Jun 16 '22 11:06 jpcoenen

Hi, we tried the new version of connect. Before the update the memory usage would end up between 6 and 7Gb in a day, after the update memory usage will either go up to 7Gb or around 3.8Gb. We haven't found a pattern or a reason why it's higher/lower on some days.

Aug 10 '22 08:08 simoncolincap

I also raised this #46

Aug 22 '22 14:08 tomjohnburton

Thanks for the updates. Quick update from our side: we recently found another potential cause.

While we're working on a fix, could you give an indication of how many requests per day these clusters are currently handling and by how much the memory consumption is increasing during that period? That could tell us whether our finding could explain the memory consumption you're both seeing.

Sep 08 '22 12:09 jpcoenen

We have around 14 secrets, polling at the default (every 15s)

Between 08/09/22 and 12/09/22 the:

connect-api went from 10mb to 1Gb
connect-sync went from 41mb to 450mb

Sep 12 '22 15:09 tomjohnburton

Thank you for providing those extra details! 🙌

Although at this moment it is difficult to say if it is the only cause, I am convinced at least a share of it is caused by the issue we're investigating. We'll keep you posted on the progress.

Sep 12 '22 17:09 jpcoenen

Hey @jpcoenen, we had a 1password pod restart due to a 5Gb memory usage. Just wondering if you had an update?

Oct 10 '22 10:10 tomjohnburton

I'm sorry for the delay, @tomjohnburton. We are still trying to get the problem we identified fixed. Unfortunately, the root cause lies in a library Connect uses and getting the fix in there is taking longer than expected.

We are going to look into forking that library to bypass the external approval required.

Oct 10 '22 11:10 jpcoenen

Hi @jpcoenen , any update? Which library is holding us up?

Thanks

Jan 27 '23 22:01 tomjohnburton

Hey @tomjohnburton! Apologies for the late response; I've been out for a while.

Thanks for your patience so far. I am working on this right now to get the changes into a private fork of the package in question. I'll report back in this thread when that is ready to test.

Feb 09 '23 14:02 jpcoenen

Hey all! We have just released v1.6.1 of Connect which includes a fix for 2 separate memory leaks:

Addressed a problem that caused Connect's memory consumption to increase with about 100B for every request handled.
Addressed a problem that could cause Connect's memory consumption to increase by more than 1KB per request served after one of the containers (connect-api or connect-sync) disconnected.

Feel free to give it a try and let me know if it makes any difference!

Feb 13 '23 15:02 jpcoenen

Since no one commented yet on the new release:

I applied the 1.6.1 connect-api and connect-sync via overriding the helm chart yesterday. So far, it looks like the connect-sync performs better memory-wise than before.

But, the connect-api still keeps rising its memory. Please note this is only data from running a single day, so I may be too fast in my findings, but if someone could confirm these that would be great.

I can add another graph within a week of running these versions.

Both connect pods are running on a 1G limit qua resources.

Feb 21 '23 09:02 TimCostrop

Sorry for my later reply; I was out-of-office for a while.

Happy to see that we made some progress, but that line for connect-api is still way too steep. Apparently, trying to reproduce the issue locally is not the perfect solution here.

What would be really valuable, is some profiling data to see what is going on inside your Connect instance. If we added an option to expose Go's profiling data on a specific port (not the same one as the Connect API), would you be willing to download and share that with us? That file would not contain any memory contents itself, just a list of memory allocations.

It'd require updating Connect (the current version does not yet have support for this), setting a specific environment variable to enable profiling and then downloading that file after a day or so. @TimCostrop or anyone else in this thread, would you be up for that?

Mar 14 '23 11:03 jpcoenen

If those are easily applied via my configuration file injected into the helm chart, or can easily be set active via a deployment change, I'm game.

But if my understanding is correct, this will only be the case with a future version of the connect-api?

Mar 15 '23 14:03 TimCostrop

That's great, thanks!

We've internally been discussing (it was a lengthy discussion) the best method to expose this data without making accidentally making it publicly accessible. We've concluded that writing it to the container's filesystem would be an acceptable method. However, that would require some means of extracting the data and we realisz that is not always trivial. So would running something like kubectl cp <pod>:/path/to/profiling/data ./ be an option in your environment?

Mar 29 '23 13:03 jpcoenen

I've copied items from a pod before, that's not really an issue for our environment. The only thing which seems different from other pods is the fact that the connect-api has a sidecar connect-sync, so I'm not 100% sure how the kubectl copy command will handle that.

Mar 29 '23 15:03 TimCostrop

As of v1.7.0 this issue still exists. More than happy to provide profiling data.

In a very much idle cluster (4 secrets):

leak

Jul 13 '23 16:07 antoineco

@antoineco Thanks for offering to help! We have recently introduced a profiler in Connect. However, the changes to the Helm chart were missing. I've just opened a PR to get that in.

Jul 14 '23 12:07 jpcoenen

@jpcoenen works for me. Please reach out and let me know how you'd like that profiling data to be sent to you.

Jul 15 '23 11:07 antoineco

We've just released the new version of the chart. To get the profiler set up, update the chart and then enable the profiler. Looking at the chart you shared, I think we can increase the interval to 12h.

helm repo update
helm upgrade connect 1password/connect --set connect.profiler.enabled=true --set connect.profiler.interval=12h

That will make the internal profiler write a profiler snapshot once ever 12h. It will keep only the 12 most recent snapshots.

Then after having Connect run for a few days, you can run the following to extract the profiles:

kubectl cp -c profiler-data <pod>:/data profiler
tar czf profiler.tar.gz profiler

Could you share that with me via email (joris.coenen[email symbol]1password.com)? Thanks!

Jul 17 '23 07:07 jpcoenen

Hi @jpcoenen I tried to apply the latest helm chart with the requested profiler settings, but chart version 1.12.0 with connect version 1.7.1 does not run on our cluster:

Warning Failed 5s (x3 over 7s) kubelet Error: container has runAsNonRoot and image will run as root (pod: "onepassword-connect-7f8d97b57b-b5wdv_general(ed611dcd-60a2-4416-bc96-04413a838ef9)", container: profiler-data)

The profiler sidecar wants to run as root, which is not allowed in our environment. I'm also not seeing a lot of configuration options in the profiler sidecar image definition to change this.

Jul 25 '23 16:07 TimCostrop

@TimCostrop Thanks for looking into getting the profiler snapshots! I really appreciate how willing you all are to help us figure out the root cause of this issue. In the meantime, I have received some profiler snapshots from @antoineco, which do seem to confirm a suspicion I already had: Go does not report any memory leaks, even though Kubernetes reports steadily increased memory consumption. Since that aligns with what I am seeing locally, this one set of profiler data is enough for now. I'll let you know if that changes.

My current hypothesis is that we're not actually dealing with a leak in the traditional meaning, but that there is either an issue with how Kubernetes is reporting memory usage or how it is taking back memory from the Go runtime.

There are quite a few similar reports, especially when it comes to Kubernetes and Golang: https://github.com/golang/go/issues/44554 https://stackoverflow.com/questions/68289644/relationship-between-container-memory-working-set-bytes-and-process-resident-mem

I am currently trying to figure out what we can best do there. I am wondering whether it would work to apply some "memory pressure" to the container by limiting its maximum memory consumption. That might trigger container runtime to take back the memory the Go runtime has released.

Jul 31 '23 14:07 jpcoenen

I am really starting to get convinced of the above hypothesis that this is caused by k8s not reclaiming the memory that Go/Connect has released. Adding the following simple resource limits complete removes the upwards trend of the memory consumption for me:

resources:
  limits:
    memory: 20Mi
  requests:
    memory: 15Mi

Without any memory limits:

With some memory limits in place:

That specific limit might be a bit too harsh, but it does seem to suggest that it is a part of the problem.

You should be able to specify some resource limits to Connect through our helm chart through the connect.api.resources and connect.sync.resources values. Feel free to give that a shot, and let me know if that changes anything for you. If you have already set limits, can you let me know if Connect goes over those (i.e. the pod gets killed) or stays around/below them?

In the meantime, I'll see if we can set some sane defaults for these values.

Jul 31 '23 16:07 jpcoenen

I've done some further experimentation today. Using Chaos Mesh I've simulated a pod that consumes a lot of memory. The experiment suggest that Kubernetes will reclaim the memory that is apparently in-use by Connect the moment it needs it for another container.

Here is a graph of the memory consumption of my Connect pod:

CleanShot 2023-08-02 at 15 01 46

Around 14:10 I started the memory stress test which claimed a lot of memory in a different pod:

CleanShot 2023-08-02 at 15 05 56

This immediately led to a drop in apparent memory consumption for the Connect pod.

This suggests that the high numbers reported for the container_memory_working_set_bytes (blue) and especially container_memory_usage_bytes (yellow) are not really an issue. They will drop when needed.

Having the reported memory consumption grow over time is still not nice. So we need to put some good default memory limitations in place. However, I'd also like to hear if anyone in this thread ever encountered any problems with the high reported memory usage. For example, has the Connect pod ever been recycled for anyone because of high memory usage?

Aug 02 '23 13:08 jpcoenen

AFAIK the connect pod hasn't ever recycled with us, but it has triggered various monitoring alerts due to memory pressure for our cluster when no resource limits were defined.

Our current setup limits the connect-api and connect-sync to 1G memory, which seems like a lot going through the thread again/looking at the given memory usage examples, but this can also mean it never really has a memory issue given the high amount given and not really needed.

Aug 02 '23 14:08 TimCostrop

Thanks for the extra info, @TimCostrop.

AFAIK the connect pod hasn't ever recycled with us, but it has triggered various monitoring alerts due to memory pressure for our cluster when no resource limits were defined.

Okay, that's good to know. Is there maybe some flexibility in what kind of memory consumption you are monitoring? My experiments indicate that the container_memory_rss metric reported by k8s is actually a more realistic reprentation of Connect memory's consumption. It does not grow over time. Is there any chance that you can monitor that instead?

Our current setup limits the connect-api and connect-sync to 1G memory, which seems like a lot going through the thread again/looking at the given memory usage examples, but this can also mean it never really has a memory issue given the high amount given and not really needed.

In my experiments, Connect's RSS memory consumption never went beyond 50MiB, even when continuously bombarding it with read requests. So I think 128 MiB or 256 MiB should be plenty. You should be able to verify this by checking container_memory_rss of your Connect pod.

The only exception I can think of, is if you are using Connect to read very big files. Currently, Connect decrypts files in-memory. So if you want it to serve a file of 500 MiB, it will need at least 500 MiB of memory.

Aug 02 '23 15:08 jpcoenen

connect connect copied to clipboard

High memory usage/possible memory leak

connect
connect copied to clipboard