parca icon indicating copy to clipboard operation
parca copied to clipboard

High memory usage

Open paulfantom opened this issue 3 years ago • 12 comments

I got quite high memory usage from parca (~45GB) and parca-agent (~1GB). This is on a one node setup without any additional scrape_configs. Node has just 44 pods running.

parca pprof from heap - https://share.polarsignals.com/f8b0beb/ parca-agent pprof from heap - https://share.polarsignals.com/89d4e84/

Might be related to https://github.com/parca-dev/parca/issues/283 but I did not have time to investigate.

paulfantom avatar Oct 13 '21 09:10 paulfantom

Prometheus output: Screenshot from 2021-10-13 12-02-14

paulfantom avatar Oct 13 '21 10:10 paulfantom

As for Parca itself, this looks "good" as the majority of heap is used with the chunkenc.Pool. Just today I started working on removing the need to store cumulative values and only store the flat values (leaf values) which we can calculate the cumulative values from by simply adding flat values together. Then, in the future, once we get to a persistent storage, we'll be able to mmap older chunks to not constantly have them in-memory (just like Prometheus).

Bear with us until then. Thanks for reporting!

metalmatze avatar Oct 13 '21 10:10 metalmatze

For now, I'd recommend running Parca with retention of a few hours 3h, 6h, 12h, depending on what you're willing to pay for in terms of memory. Chunks older than the retention are emptied and reused for newer chunks. This should make things more predictable for now.

metalmatze avatar Oct 13 '21 10:10 metalmatze

Update after redeploying with 3h storage retention. Memory is still increasing but at a slower rate. After 17h parca is consuming ~30GB of memory.

Screenshot from 2021-10-14 09-31-23

paulfantom avatar Oct 14 '21 07:10 paulfantom

This may still be an issue. I’ve set storage-tsdb-retention-time=30m (the orange line) and memory consumption appears unaffected. The blue line is with 1h retention so I’d rather expected that the memory usage would be lessened for the server running with 30m. I’m running v0.7.1

image

avestuk avatar Feb 08 '22 17:02 avestuk

The one thing that definitely still keeps increasing with no vacuuming happening is metadata. So if you have a lot of churns (maybe you have many deployments happening, etc) then this is right now ever so slightly increasing. Other than that it should be running mostly stable. At least that's the case for https://demo.parca.dev

Could you check the /metrics endpoint of Parca see if you can observe an increase of truncated chunks over time? The metric to look for is parca_tsdb_head_truncated_chunks_total. Another one would be parca_tsdb_head_min_time and seeing if that increases ever so slightly after the oldest chunks are truncated too. Let us know!

metalmatze avatar Feb 09 '22 17:02 metalmatze

This is a graph from this morning

image

We've also observed some very large spikes in memory usage from the Parca agents themselves. image

avestuk avatar Mar 24 '22 09:03 avestuk

Hey all. We've landed the new and improved storage in main. You can try a latest image (like ghcr.io/parca-dev/parca:main-ddad21cc), and enable the new storage with --storage=columnstore. Would love to see how it performs for you! Note that currently the column-store will accumulate 512mb of memory and then throw away all data and start over. The amount can be controlled with --storage-active-memory=536870912. We're still working on persisting the data.

brancz avatar Apr 14 '22 08:04 brancz

@brancz just weighing in on this with what I have observed running v0.12.0 as it may be helpful.

Heres the relevant config:

  containers:
    - resources:
        limits:
          cpu: '8'
          memory: 32Gi
        requests:
          cpu: '2'
          memory: 8Gi
      image: 'quay.io/observatorium/parca:v0.12.0'
      args:
        - /parca
        - '--config-path=/var/parca/parca.yaml'
        - '--log-level=info'
        - '--storage-active-memory=20000000000'       

And a graph of the memory usage:

Screenshot 2022-07-22 at 15 48 41

It didn't OOM (although eerily close to the limit 🤔 ) so I expect there was some rotation when storage-active-memory was reached. I just wanted to clarify if the behaviour looks ok up until that point though? Memory usage seems to steadily increase and we are scraping ~20 static (no churn) Pods as targets.

philipgough avatar Jul 22 '22 14:07 philipgough

Churn doesn't make a difference for Parca, but that's beside the point. --storage-active-memory accounts for the buffers held by FrostDB, plus some metadata around that, but that's negligible. What that means is that you always have to also account for the badger-based metastore on top of that. And then all of that is garbage collected by the Go runtime, meaning a fair amount of garbage needs to be on the heap for GC to run and free it, but it will also be freed by the Go runtime if another process wants to allocate it. I realize that doesn't necessarily make it a lot easier to reason about, but it puts things into perspective and given all of that, that doesn't seem entirely unreasonable (at least in terms of the pattern).

All of that said, we do know about a bunch of low-hanging fruit that can be easily optimized in the way how we store data physically (in memory and on disk): https://github.com/parca-dev/parca/issues/1309.

Does that somewhat answer the question?

brancz avatar Jul 22 '22 17:07 brancz

Actually, one more thing, could you share a memory profile? Then we could see if my suspicion is true in terms of where the memory is spent.

brancz avatar Jul 22 '22 17:07 brancz

That makes sense, and answers the question, thanks for providing the details.

There was nothing that looked particularly off to me either but since I had the data from the latest release and had been following this issue I felt it would be useful to provide it.

I'll follow up with the profile.

philipgough avatar Jul 25 '22 09:07 philipgough

Hi @brancz . I just downloaded Parca for the first time from the release page and ran it with the default configuration file from the documentation without any additional switches (which implies the default of 512MB for FrostDB), and I'm getting way more memory usage than that reported by Parca itself while monitoring itself.

EDIT: I'm using only one parca-agent (using a systemd target) running on the localhost.

What do you guys think? I've attached the screenshot.

If I let it run for another few minutes..... it'll just continue to grow unbounded until its OOM-killed.

leak

mraygalaxy avatar Aug 19 '22 21:08 mraygalaxy

https://pprof.me/96f41cb

Can see that the most of memory is being utilized by splitRowsByGranule and not by the actual storage blocks.

thorfour avatar Aug 22 '22 21:08 thorfour

image We get a little over 6GiB before the block is rotated and memory usage drops

thorfour avatar Aug 22 '22 21:08 thorfour

https://github.com/polarsignals/frostdb/pull/189 should fix the high memory usage

thorfour avatar Sep 07 '22 16:09 thorfour

Closing this issue as https://github.com/parca-dev/parca/pull/1682 has merged.

thorfour avatar Sep 08 '22 17:09 thorfour

@thorfour I don't believe this is fully resolved. I rebuilt parca from master and while it grows a little bit slower, it's still getting OOM-killed eventually.

I let it grow for a while, and it looks like the biggest consumer of memory is in the scrapeLoop? And then further down below that, most of that still leaking in FrostDB.

Here is a screenshot:

NOTE: This is on a vanilla Ubuntu 22.04 system, running a single parca-agent on the localhost. Pretty basic setup.

leak

mraygalaxy avatar Sep 08 '22 21:09 mraygalaxy

Ping?

mraygalaxy avatar Oct 17 '22 21:10 mraygalaxy

Thanks! Sorry I missed the previous message, I'll dig into this more.

Do you have a profile you can share with this high memory usage? Can upload to pprof.me

thorfour avatar Oct 17 '22 21:10 thorfour

Also, could you send the commit sha you're running on, and the flags you're using to run it?

thorfour avatar Oct 17 '22 21:10 thorfour

Closing this issue as I believe much of the memory pressure has been resolved by various improvements since this issue. Please re-open with additional information if you're able to recreate on the latest versions of Parca

thorfour avatar Mar 07 '23 20:03 thorfour