chproxy Amount of keys in redis cache is constantly growing

trafficstars

chproxy v1.16.3

There are redis cache configured as following:

- name: cache-name
  mode: redis
  redis:
    addresses:
    - hostname:6379
  expire: 300s

There are following keys appearing:

799) "26243a2fa2435c33269f75da4e359f3e-transaction"
800) "62fcbcea7329cd2615aa4b5a792b0420-transaction"
801) "063cdf4e211ee0441f15323503515d57-transaction"
802) "802fa878e67d0516bcd41ddf3521e826-transaction"
803) "be3ca83940942e976e55f9c04f8ea44c-transaction"
804) "c8c7144ba9a34b53a6c6069facc3b3d5-transaction"
805) "1b604171cf60b2320b477e440e39cfcc-transaction"
806) "4406a4e641f31e7b91b3e24e59d9c6ef-transaction"
807) "ab841321dadb6921814bad6113b9b0bb-transaction"
808) "9205c903c51f8fcb1a9e9e9fa0a08959-transaction"
...

redis> TTL 9205c903c51f8fcb1a9e9e9fa0a08959-transaction
(integer) -1

redis> get 9205c903c51f8fcb1a9e9e9fa0a08959-transaction
"2"
redis> DEBUG OBJECT 9205c903c51f8fcb1a9e9e9fa0a08959-transaction
Value at:0x7fb41d22cd20 refcount:1 encoding:int serializedlength:2 lru:15865814 lru_seconds_idle:6

Somehow those keys has no TTL set, and it's amount is constantly growing, even if the actual cached result has been removed by the TTL

Despite the small size of those keys, with a large amount of transactions those keys may consume all the memory of the redis instance, and then pushed out by the LRU, when memory will be required. In this case, consumed memory will stay ~constant near the configured maximum for the instance.

Is it expected behavior?

Aug 09 '22 08:08 PhantomPhreak

It looks suspicious. It might come from the given change done 1 month ago that was done in order to be backward compatible with older version of redis servers https://github.com/ContentSquare/chproxy/pull/184 Could you do the same test on v1.16.0 (that doesn't contain this change) and give more details about the redis you're using (clustered, which version)?

Aug 09 '22 12:08 mga-chka

@mga-chka

Could you do the same test on v1.16.0 (that doesn't contain this change)

Tested, results seem to be the same

# ./chproxy --config chproxy-conf.yaml                                                                                                                                               
INFO: 2022/08/10 00:36:52 main.go:46: chproxy ver. 1.16.0, rev. c4aece8f482043ca7662087db864990711f1d9b0, built at 2022-06-24T11:46:12Z                                                                     
INFO: 2022/08/10 00:36:52 main.go:47: Loading config: chproxy-conf.yaml                                                                                                                                     
INFO: 2022/08/10 00:36:52 main.go:280: Loaded config:                                                                                                                                                       
server:                                                                                                                                                                                                     
  http:                                                                                                                                                                                                     
    listen_addr: :88                                                                                                                                                                                        
...
clusters:
- name: local
  scheme: http
  nodes:
  - 127.0.0.1:8123
  users:
  - name: default
    password: XXX
...
users:
- name: test
  password: XXX
  to_cluster: local
  to_user: default
...
  cache: test-redis
log_debug: true
hack_me_please: true
caches:
- mode: redis
  name: test-redis
  expire: 5m
  redis:
    addresses:
    - redis-cache-host:6379
INFO: 2022/08/10 00:36:52 main.go:55: Loading config "chproxy-conf.yaml": successful
INFO: 2022/08/10 00:36:52 main.go:160: Serving http on ":88"

After making some test queries picture is following:

redis-cache-host:6379> keys *
1) "2aa5102467ea68cb127356062a039731-transaction"
2) "039296291c7039831ebb50bb51f2789e-transaction"
3) "ceb4cf07c3edc2e325200c494f971a77-transaction"
4) "d0b49760f8ea57fda09c10850b379bd3-transaction"
5) "ece9ae938180cd203dc92ff2d0035a60-transaction"
redis-cache-host:6379> TTL ece9ae938180cd203dc92ff2d0035a60-transaction
(integer) -1
redis-cache-host:6379> DEBUG OBJECT ece9ae938180cd203dc92ff2d0035a60-transaction
Value at:0x7f7727c03158 refcount:1 encoding:int serializedlength:2 lru:15949883 lru_seconds_idle:1039
redis-cache-host:6379> INFO
# Server                                          
redis_version:6.0.16                  
redis_git_sha1:63c47a09                          
redis_git_dirty:1                                    
redis_build_id:f20f71e10920f6e0                  
redis_mode:standalone                             
os:Linux 4.19.0-8-amd64 x86_64        
arch_bits:64                                     
multiplexing_api:epoll                                                             
atomicvar_api:atomic-builtin                     
gcc_version:7.5.0
...
# Cluster                      
cluster_enabled:0

Aug 10 '22 07:08 PhantomPhreak

There is something weird. I just checked a few records fromour redis in production and the TTL from CHProxy is working well

[adresse of redis]:[port YYY]> TTL 127efbf543a03465d854a60b4fb557ed-transaction (integer) 15 [adresse of redis]:[port YYY]> TTL 127efbf543a03465d854a60b4fb557ed-transaction (integer) 9 [adresse of redis]:[port YYY]> TTL 123c1c91fe87cab63b22d083893106ae-transaction (integer) 12 [adresse of redis]:[port YYY]> TTL 1244629318897d9b7521ae88e79839e9-transaction (integer) 17

The main difference is that we're using redis version:5.0.6 (and chproxy 1.16.2). We will try to do some tests on redis 6 when we have time to see if the bug is linked to the version of redis. But if you have the possibility, could you try with a redis 5 ?

Aug 10 '22 10:08 mga-chka

After looking at the code, the issue might come from your configuration of chproxy.

chproxy stores 2 kinds of objects in Redis:

the transactions that are used to handle concurrent transactions: it prevents chproxy to run twice the same SQL queries on CH when they are not cached yet and used at the same time. They are very small objects whose keys are suffixed by "-transaction". It's the objects whose TTL doesn't work in your case. The TTL is automatically set to the value 2 * max_execution_time (2 minutes per default) or grace_time (which is supposed to be deprecated and you shouldn't use it). If your graceTime is negative or if your maxExecutionTime is set at 0 you could have some TTL issues.
the cached queries (potentially large objects) whose TTL is handled by the "expire" conf. You don't seem to have issues on this one given the result of your "keys *" in redis

So did you modified the max_execution_time or grace_time in your configuration file? If yes, what are they?

Aug 10 '22 10:08 mga-chka

@mga-chka

it prevents chproxy to run twice the same SQL queries on CH when they are not cached yet and used at the same time.

Okay, you're taking about thundering herd, explained in the doc Logic here is clear.

So did you modified the max_execution_time or grace_time in your configuration file? If yes, what are they?

I will show you the exact config i used to reproduce this problem:

# cat chproxy-conf.yaml 
---
log_debug: true
hack_me_please: true
caches:
- name: test-redis
  mode: redis
  redis:
    addresses:
    - redis-cache-host:6379
  expire: 300s
users:
- name: testuser
  password: password
  to_cluster: local
  to_user: default
  cache: test-redis
clusters:
- name: local
  scheme: http
  nodes:
  - 127.0.0.1:8123
  users:
  - name: default
  heartbeat_interval: 5s
server:
  http:
    listen_addr: :88
...

Reported config when chproxy is starting, some defaults has been applied:

# ./chproxy --config chproxy-conf.yaml                                                                                                                                               
INFO: 2022/08/10 04:07:20 main.go:46: chproxy ver. 1.16.0, rev. c4aece8f482043ca7662087db864990711f1d9b0, built at 2022-06-24T11:46:12Z                                                                    
INFO: 2022/08/10 04:07:20 main.go:47: Loading config: chproxy-conf.yaml
INFO: 2022/08/10 04:07:20 main.go:280: Loaded config:
server:
  http:
    listen_addr: :88
    forceautocerthandler: false
    read_timeout: 1m
    write_timeout: 90s
    idle_timeout: 10m
clusters:
- name: local
  scheme: http
  nodes:
  - 127.0.0.1:8123
  users:
  - name: default
    password: XXX
  heartbeat_interval: 5s
  heartbeat:
    interval: 5s
    timeout: 3s
    request: /?query=SELECT%201
    response: |
      1
users:
- name: testuser
  password: XXX
  to_cluster: local
  to_user: default
  max_execution_time: 30s
  cache: test-redis
log_debug: true
hack_me_please: true
caches:
- mode: redis
  name: test-redis
  expire: 5m
  redis:
    addresses:
    - redis-cache-host:6379
INFO: 2022/08/10 04:07:20 main.go:55: Loading config "chproxy-conf.yaml": successful
INFO: 2022/08/10 04:07:20 main.go:160: Serving http on ":88"
...

after connecting with clickhouse-cli and getting the promt i have following keys in the redis:

redis-cache-host:6379> keys *
1) "d0b49760f8ea57fda09c10850b379bd3-transaction"
2) "039296291c7039831ebb50bb51f2789e"
3) "d0b49760f8ea57fda09c10850b379bd3"
4) "039296291c7039831ebb50bb51f2789e-transaction"
redis-cache-host:6379> TTL d0b49760f8ea57fda09c10850b379bd3-transaction
(integer) -1
redis-cache-host:6379> TTL 039296291c7039831ebb50bb51f2789e-transaction
(integer) -1
redis-cache-host:6379>

as you can see, max_execution_time is set to default value 30s, but infite TTL is still set for redis object. grace_time is not used since it's deprecated.

Actually, i was going to make a separated issue about default value for the max_execution_time setting. There are 2 ways to control the maximum amount of time for the specific user in chproxy:

by setting max_execution_time in user config section of chproxy: https://github.com/ContentSquare/chproxy/tree/master/config#user_config By the way, doc is not correct since default value is not 0 now.
In the params group, by the max_execution_time tunable for the ClickHouse itself (not chproxy).

As you can see, there are 2 ways to configure the same parameter with the same variable (same name), and both of them may be applied at the same time! This is very confusing, query execution time limit should be configured only in one place. Before we were using v1.13.2, and unset max_execution_time value was just disabling the limit on the Chproxy side, allowing to set the limit on the ClickHouse side, and it was very usable because we was able to use param groups to set this limit. With a recent versions unset max_execution_time means 30s, which is quite weird because ClickHouse it's OLAP DB, where queries may take hours to finish. And it's necesseary to set explicitly for every user now, which is very annoying.

And it's not possible simply disable the max_execution_time on the chproxy side:

unset max_execution_time variable or 0 leads to default 30s, which is LOW
negative value is not valid
positive values require matching chproxy- and clickhouse-side timeouts which is annoying

For now i just set 86400s as max_execution_time on the chproxy side, which is guaranteed above all the clickhouse-side limits, to make sure chproxy-side limit will never come into play. Ugly hack :(

Aug 10 '22 11:08 PhantomPhreak

Thanks, we'll try to reproduce the issue on our side and fix it. Since the cached query works on your side (since they disappear from Redis cf the listing of keys in your first post where all the keys end will a "-transaction") , it can't be linked to a redis version issue or something related to your infrastructure.

Regarding the max_execution_time, you're not the first one complaining about this new default value, this is why the default value is now at 2 minutes in the last version of chproxy. We made this change because of the way the concurrent query handling worked but I think we will find another way so that the max_execution_time can be back to unlimited.

FYI, if you set 86400s as max_execution_time, the "-transactions" will stay 172800 sec/48 hours in redis (once we find the current issue with the TTL).

Aug 10 '22 12:08 mga-chka

FYI I was able to reproduce the bug. There is an ongoing PR regarding the way we fetch the max_execution_time in the config because it's a bit broken and will not work for many users of chproxy. https://github.com/ContentSquare/chproxy/pull/190. It should be merged soon. Until the next release of chproxy, you can bypass the pb by setting the grace_time in the conf for example here is a working conf similar to the one you tried

log_debug: true hack_me_please: true caches:

name: test-redis mode: redis redis: addresses: - 127.0.0.1:6379 expire: 300s grace_time: 300s users:
name: user password: pwd to_cluster: local to_user: default cache: test-redis max_concurrent_queries: 2 max_execution_time: 30s clusters:
name: local scheme: http nodes:
- 127.0.0.1:8123 users:
- name: default heartbeat_interval: 5s server: http: listen_addr: ":9090" allowed_networks: ["127.0.0.0/24"]

Aug 10 '22 13:08 mga-chka

@mga-chka

We made this change because of the way the concurrent query handling worked but I think we will find another way so that the max_execution_time can be back to unlimited.

Thanks, that would be nice. Actually, i was pretty much happy with per-cache grace_time setting, it was easier to configure and logic behind it was clear.

FYI, if you set 86400s as max_execution_time, the "-transactions" will stay 172800 sec/48 hours in redis (once we find the current issue with the TTL).

That's true, but i don't see so far how it will affect the behavior. For now, TTL is even set to -1 (unlimited) for the the "-transactions", but caching logic still working fine as i tested. Maybe i'm just missing something?

Until the next release of chproxy, you can bypass the pb by setting the grace_time in the conf for example here is a working conf similar to the one you tried

Actually, it seem working fine so far with high max_execution_time value as i mentioned above, except the fact that amount of keys are growing, which is not really noticeable in the real-world applications. Am i missing any other flaw of this approach?

Aug 11 '22 09:08 PhantomPhreak

no your not missing something. Putting the grace_time would just avoid the garbage in Redis from the "-transaction" items. If your redis is configured to automatically remove objects and you don't mind the pollution from those small objects then you can keep it as it is.

Aug 11 '22 17:08 mga-chka

Frankly say, it's not a big issue so far. I hope https://github.com/ContentSquare/chproxy/pull/190 will be merged prior my 1GiB redis instance will be completely hogged by 1-2 bytes records :) But i'll keep your suggestion in mind, thank you!

Aug 12 '22 05:08 PhantomPhreak

since the https://github.com/ContentSquare/chproxy/pull/190 was merged and a new version was released, I'm closing this issue. If you still have issues, feel free to re-open this PR.

Aug 23 '22 14:08 mga-chka

Thanks guys, i've just tried v1.16.4 and TTL for *-transaction keys set correctly in redis: 2xmax_execution_time

Aug 24 '22 08:08 PhantomPhreak

chproxy chproxy copied to clipboard

Amount of keys in redis cache is constantly growing

chproxy
chproxy copied to clipboard