chproxy icon indicating copy to clipboard operation
chproxy copied to clipboard

Memory consumption spikes after upgrading to v1.16.x

Open PhantomPhreak opened this issue 3 years ago • 5 comments
trafficstars

We're running chproxy in k8s, before we were using v1.13.2 and memory consumption was constantly low, we used 512Mi limit and it was enough.

After upgrading to v1.16.x we've got chproxy PODs OOMed constantly, and with 8Gi limit picture look like following: image

There are very clear seasonality, in the beginning of 10th minute we have a memory consumption spike. But not always image

Here is some go memstats metrics, exported by the chproxy instance: image

For this instance, we do not have any file_system type caches, but have 3 caches in Redis. We have 4 different clusters, 10 cluster users and about 150 users configured.

I tested this configuration on the another chproxy instance, constantly staying idle (no user-requests served), and memory consumption there remain constant (no spikes).

For now, even with a 8Gi limit we have chproxy PODs oomed sometimes :( Our current version is 1.16.3

Any ideas how to find a cause of this issue? Thanks!

PhantomPhreak avatar Aug 12 '22 07:08 PhantomPhreak

Hi,

Are you running queries that retrieve a lot of data? If yes then it's a "known" issue.

It was created during the implementation of the redis cache feature and it's linked to the flow of a query in chproxy. Here is how a not cached query is handled: [send request to clickhouse] => [temporary store the result] => [check if the query is worth caching] => [cache query if needed] => [return result to the sender]. Before the Redis implementation, the temporary storage was disk based, now it's memory based.

I think we will do a quick fix and go back with a disk based temporary storage. The ideal solution would be to avoid the temporary storage and stream everything but it might require a huge refactoring of the code. What do you think @Garnek20 ?

mga-chka avatar Aug 12 '22 08:08 mga-chka

Are you running queries that retrieve a lot of data?

For some queries response body size can be few MBs, i'm not sure it's a lot or not: image

Before the Redis implementation, the temporary storage was disk based, now it's memory based.

Aah, okay, this explains everything. Previous solution was working pretty much fine, but with a current implementation if, for example, some clients fetching more data, than current mem limit, chproxy will just crash due to OOM. If mem-backed temporary storage has been implement to reduce the latency for "short" queries, then small results may be stored in the memory, but big one (>threshold?) may be stored on the disk-backed storage. Just an idea.

Anw it's good to know it's a "known" issue :)

PhantomPhreak avatar Aug 12 '22 08:08 PhantomPhreak

Hi @PhantomPhreak , can you also share how many QPS chproxy handles during OOMs for the queries of roughly 5Mb / query result?

gontarzpawel avatar Aug 12 '22 08:08 gontarzpawel

@Garnek20 hi we've got it OOMed @ 13:31:02, here is a related graphs for this timeperiod: image

i hope it helps!

PhantomPhreak avatar Aug 12 '22 11:08 PhantomPhreak

Hi, thanks it helps.

FYI we took a look at the pb and it seems deeper than expected. The redis feature (that needed a huge refacto of the code) added a few hidden copies of objects so there is an amplification btw the result of the query and the amount of memory needed in chproxy to handle it. It's also lightly impacts the old filesystem cache (there is at least one full copy of the result). It's a huge regression that started since 1.15.X versions.

We're going to release a fix ASAP (at least a huge improvement even if it doesn't fix everything) but since it's deep and we're partially off next week, don't expect something before next 2 weeks.

mga-chka avatar Aug 12 '22 17:08 mga-chka

Hi, FYI a fix is almost ready, it should be available next week (we will release a new version as soon as the PR is merged). You can read the description of the PR to see what you can expect. https://github.com/ContentSquare/chproxy/pull/212

mga-chka avatar Aug 25 '22 23:08 mga-chka

A new version with the fix is released https://github.com/ContentSquare/chproxy/releases/tag/v1.17.0 feel free to use it. I close the PR, if this new version still doesn't solve your memory overhead, feel free to reopen it.

mga-chka avatar Aug 29 '22 18:08 mga-chka