odyssey icon indicating copy to clipboard operation
odyssey copied to clipboard

Problem of memory

Open flionet89 opened this issue 2 years ago • 23 comments

Dobryy den'! YA stolknulsya s problemoy utechki pamyati. Ustanovil Odyssey i pustil cherez nego vse mikro servisy chto byli. Eto poryadka 100 podov. V itoge odyssey ne vysvobozhdayet pamyat'. Ispol'zuyu konfig: 199 / 5,000 Translation results Good afternoon! I ran into a memory leak issue. I installed Odyssey and launched all the micro services that were through it. This is about 100 pods. As a result, odyssey does not release memory. I am using config: conf yandex.txt The schedule is like this: image The log has this message: Tell me what could be the problem?

flionet89 avatar Mar 01 '22 14:03 flionet89

Same issue here with version 1.2

image

cleocir avatar May 23 '22 18:05 cleocir

Oops, soory, somehow I missed the issue..

Do you have something specific in your workload? So far I know about one leak in error passing from server to client, but do not have a reproduction yet.

x4m avatar May 24 '22 04:05 x4m

auth_query and nothing more. Does system worker keep connections without any frees? at start:

1 2022-06-20T11:13:22Z info [none none] (stats) clients 0
1 2022-06-20T11:13:22Z info [none none] (stats) worker[0]: msg (0 allocated, 0 cached, 3 freed, 0 cache_size), coroutines (1 active, 0 cached), clients_processed: 0
1 2022-06-20T11:13:25Z info [none none] (stats) system worker: msg (5 allocated, 0 cached, 1 freed, 0 cache_size), coroutines (3 active, 0 cached) startup errors 0

in a few hours:

1 2022-06-20T16:11:04Z info [none none] (stats) clients 0
1 2022-06-20T16:11:04Z info [none none] (stats) worker[0]: msg (0 allocated, 0 cached, 5951 freed, 0 cache_size), coroutines (1 active, 0 cached), clients_processed: 0
1 2022-06-20T16:11:07Z info [none none] (stats) system worker: msg (5953 allocated, 0 cached, 1 freed, 0 cache_size), coroutines (3 active, 0 cached) startup errors 0

and in kubernetes stats memory usage increases photo_2022-06-20_19-20-55 :

alexdyukov avatar Jun 20 '22 16:06 alexdyukov

Hello, I faced with the same problem.

memoryleaksodyssey

what kind of diagnosis can be done?

vilyansky avatar Dec 25 '22 13:12 vilyansky

Same problem. Version 1.3 image image

GRouslan avatar Mar 16 '23 09:03 GRouslan

Добрый день. Есть ли понимание в чем может быть проблема?

flionet89 avatar Apr 17 '23 09:04 flionet89

Понимание - есть, утечка, скорее всего, в auth_query. Но пока руки не дошли до этого, смотрю в 487 и 483.

Same in English, if anyone is concerned. Ruslan asked if root cause is known. Root cause is probable memory leak in auth_query implementation. I'm not actively working on this leak right now, because issues 487 and 483 are in my scope.

x4m avatar Apr 17 '23 10:04 x4m

This is still on. I can repro it easy with

while true
do
sudo -u postgres pgbench -c 500 -r -T10 -j40 -h 127.0.0.1 -U testuser test
done

And see that RSS memory is growing steadily.

evkuzin avatar Jul 25 '23 08:07 evkuzin

In fact, there are leaks without auth_query, especially if you increase stack_size

ilya-maltsev avatar Jul 25 '23 10:07 ilya-maltsev

@x4m, any chance to look into it in the nearest future? 🙏

evkuzin avatar Jul 25 '23 13:07 evkuzin

Definitely I'll look into this. But It does not reproduce for me with regular pgbench.

x4m avatar Jul 25 '23 13:07 x4m

@x4m Can you show odyssey.conf that you are used in tests?

ilya-maltsev avatar Jul 25 '23 13:07 ilya-maltsev

Here is mine

daemonize no
pid_file "/var/lib/odyssey/odyssey.pid"
locks_dir "/run/odyssey"
graceful_die_on_errors no
#enable_online_restart yes
bindwith_reuseport yes
log_file "/var/log/postgresql/odyssey.log"
log_format "%p %t %l [%i %s] (%c) %m\n"
log_to_stdout no
log_syslog no
log_syslog_ident "odyssey"
log_syslog_facility "daemon"
log_debug no
log_config no
log_session no
log_query no
log_stats yes
stats_interval 60
promhttp_server_port 7777
workers 5
resolvers 2
readahead 8192
cache_coroutine 5
coroutine_stack_size 8
nodelay yes
keepalive 15
keepalive_keep_interval 75
keepalive_probes 9
keepalive_usr_timeout 0
bindwith_reuseport yes
unix_socket_dir "/var/run/postgresql"
unix_socket_mode "0777"

listen {
    host "*"
    port 5432
    backlog 256
    tls "allow"
    tls_ca_file "/etc/ca-certificates/root.crt"
    tls_key_file "/etc/odyssey/odyssey.key"
    tls_cert_file "/etc/odyssey/odyssey.crt"
    compression no
}

listen {
    port 5432
    backlog 256
    compression no
}


storage "postgres_server" {
    type "remote"
    host "127.0.0.1"
    port 5433
}

storage "postgres_server_unixsock" {
    type "remote"
    port 5433
}

storage "local" {
    type "local"
}

database default {
    user default {
        authentication "md5"
        auth_query "SELECT uname, phash FROM user_lookup($1)"
        auth_query_db "postgres"
        auth_query_user "odyssey"
        storage "postgres_server"
        pool "transaction"
        pool_size 100
        pool_timeout 0
        pool_ttl 60
        pool_discard yes
        pool_cancel yes
        pool_rollback yes
        pool_client_idle_timeout 0
        pool_idle_in_transaction_timeout 0
        client_fwd_error yes
        application_name_add_host yes
        reserve_session_server_connection yes
        server_lifetime 3600
        log_debug no
        quantiles "0.99,0.95,0.5"
    }
    user "odyssey" {
        authentication "none"
        storage "postgres_server_unixsock"
        pool "session"
        pool_routing "internal"
    }
}
database "odyssey_console" {
    user "prometheus" {
        authentication "md5"
        auth_query "SELECT uname, phash FROM user_lookup($1)"
        auth_query_db "postgres"
        auth_query_user "odyssey"
        storage "local"
        pool "session"
        role "stat"
    }
}

evkuzin avatar Jul 25 '23 13:07 evkuzin

Did you manage to reproduce it? May I help somehow with reproducing it?

evkuzin avatar Aug 03 '23 15:08 evkuzin

I'm overwhelmed by the number of tasks, sorry... currently we are working on making auth_query better (support for SCRAM, caching etc), I hope we will track this leak in that project.

x4m avatar Aug 03 '23 18:08 x4m

Hi! How is the work going? When shall we expect the new version?

evkuzin avatar Aug 24 '23 15:08 evkuzin

@x4m in case it could help. Disabling the auth query + turning off all logging allowed running it more or less stable. So it's not exactly an auth query issue, I think. It might be a machinarium issue.

evkuzin avatar Aug 29 '23 16:08 evkuzin

Recently I've fixed one leak in query cancelling #527 But so far no further progress on the issue. @evkuzin is there some specific kind of logging that seems to leak? I do not even have a reproduction. Some folks hinted that lowering pool_ttl might highlight some leaks, though I could not reproduce it yet.

x4m avatar Sep 10 '23 10:09 x4m

Thank you for looking into it!

Currently, it works for me if I disable all possible logging everywhere. image This picture is about the amount of free memory on the node. Both replicas were OOM, then one restarted with no logs and static users config, and the second (which is eating all the memory) with config like I posted above. Try the config above and see RSS for Odyssey

while true
do
pgbench -n -P2 -C -c 500 -t100000 -r -j100 -h ODYSSEY_HOST -U test sbtest
done
while true
do
sudo ps -axo pid,rss | grep $(pgrep odyssey) | awk '{print $2}'; sleep 10
done
8808
24760
25176
25088
25136
25044
25356
25432
25452
25288
25316
25320
25704
25736
25708
25564
25628
25836
25764
25876
25776
25800

evkuzin avatar Sep 11 '23 17:09 evkuzin

I'll try repro tomorrow with the build from the master and your fix.

evkuzin avatar Sep 11 '23 17:09 evkuzin

Мне кажется моя проблема в том что я собираю бинарник как то не так. Может такое быть? Я открыл другое Issue

https://github.com/yandex/odyssey/issues/538

evkuzin avatar Oct 31 '23 16:10 evkuzin

И да - память перестала течь (по крайней мере на стенде на котором прошлая версия текла) Спасибо!

evkuzin avatar Oct 31 '23 17:10 evkuzin

Tried disabling logging and prometheus. Still leaks with auth_query (build on Possible fix for mem leak https://github.com/yandex/odyssey/pull/685) Static user configuration is too much hussle for my case. Have to restart poolers roughly once a day due to this.

12 hours of leaking: image

Object905 avatar Sep 19 '24 06:09 Object905