eye Eye severe memory leak

Hi Konstantin

I have Eye 0.9.2 installed on ubuntu 14.04

the monitored processes are:

reDash
  celery_scheduler ................ up  (Nov26, 0%, 74Mb, <1290>)
    child-1303 .................... up  (Nov26, 0%, 67Mb, <1303>)
    child-25635 ................... up  (15:33, 0%, 87Mb, <25635>)
    child-25953 ................... up  (15:37, 0%, 80Mb, <25953>)
  celery_worker ................... up  (Nov26, 1%, 72Mb, <1127>)
    child-25875 ................... up  (15:36, 0%, 85Mb, <25875>)
    child-25946 ................... up  (15:37, 0%, 85Mb, <25946>)
    child-25947 ................... up  (15:37, 0%, 85Mb, <25947>)
    child-25948 ................... up  (15:37, 0%, 80Mb, <25948>)
  flower .......................... up  (Nov26, 0%, 197Mb, <1132>)
  gunicorn ........................ up  (Nov26, 0%, 14Mb, <1304>)
    child-10761 ................... up  (12:30, 0%, 80Mb, <10761>)
    child-12652 ................... up  (12:54, 0%, 85Mb, <12652>)
    child-13920 ................... up  (13:10, 0%, 81Mb, <13920>)
    child-15800 ................... up  (13:34, 0%, 81Mb, <15800>)
    child-23143 ................... up  (15:08, 0%, 79Mb, <23143>)
    child-27819 ................... up  (09:14, 0%, 96Mb, <27819>)
    child-28854 ................... up  (09:28, 0%, 83Mb, <28854>)
    child-29043 ................... up  (09:30, 0%, 81Mb, <29043>)
    child-29362 ................... up  (09:34, 0%, 81Mb, <29362>)
  nginx ........................... up  (Nov26, 0%, 4Mb, <1372>)
    child-1374 .................... up  (Nov26, 0%, 4Mb, <1374>)
    child-1375 .................... up  (Nov26, 0%, 4Mb, <1375>)
    child-1376 .................... up  (Nov26, 0%, 4Mb, <1376>)
    child-1377 .................... up  (Nov26, 0%, 4Mb, <1377>)
  postgresql ...................... up  (15:28, 0%, 188Mb, <25069>)
    child-25071 ................... up  (15:28, 0%, 53Mb, <25071>)
    child-25072 ................... up  (15:28, 0%, 50Mb, <25072>)
    child-25073 ................... up  (15:28, 0%, 1Mb, <25073>)
    child-25074 ................... up  (15:28, 0%, 2Mb, <25074>)
    child-25075 ................... up  (15:28, 0%, 1Mb, <25075>)
    child-25181 ................... up  (15:29, 0%, 9Mb, <25181>)
    child-25200 ................... up  (15:29, 0%, 9Mb, <25200>)
    child-25233 ................... up  (15:30, 0%, 9Mb, <25233>)
    child-25268 ................... up  (15:30, 0%, 11Mb, <25268>)
    child-25371 ................... up  (15:30, 0%, 9Mb, <25371>)
    child-25377 ................... up  (15:30, 0%, 9Mb, <25377>)
    child-25540 ................... up  (15:32, 0%, 9Mb, <25540>)
    child-25541 ................... up  (15:32, 0%, 9Mb, <25541>)
    child-25624 ................... up  (15:33, 0%, 9Mb, <25624>)
    child-25639 ................... up  (15:33, 0%, 14Mb, <25639>)
    child-25876 ................... up  (15:36, 0%, 13Mb, <25876>)
    child-25949 ................... up  (15:37, 0%, 13Mb, <25949>)
    child-25950 ................... up  (15:37, 0%, 13Mb, <25950>)
  redis ........................... up  (Nov26, 128%, 10Mb, <1159>)

they add up to ~1500MB

but Eye is taking 18GB!

root 1087 0.5 54.3 25626424 17915500 ? Sl Nov26 26:40 eye monitoring v0.9.2 [reDash] (in /home/ubuntu)

you can see in this picture that the memory is leaking:

Any Idea how to find this memory leak?

Nov 29 '17 16:11 innovia

show eye x output when it in 18gb

Nov 29 '17 16:11 kostya

Ok i had to restart it - it will happen again

here's the x info now after restart:

about:     Eye v0.9.2 (c) 2012-2016 @kostya
resources: 16:19, 0%, 115Mb, <18658>
ruby:      ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux]
gems:      ["Celluloid=0.17.3", "Celluloid::IO=0.17.3", "StateMachines=0.5.0", "NIO=2.1.0", "Timers=4.1.2", "Sigar=1.7.0.0"]
logger:    /opt/redash/logs/eye.log
home:      /root
dir:       /var/run/eye
pid_path:  /var/run/eye/pid
sock_path: /var/run/eye/sock
actors:    [["Eye::ChildProcess", 33], ["Eye::Process", 7], ["Eye::Group", 1], ["Eye::Controller", 1], ["Eye::SystemResources::Cache", 1], ["Eye::Server", 1], ["Celluloid::Supervision::Service::Public", 1], ["Celluloid::IncidentReporter", 1], ["Celluloid::Notifications::Fanout", 1], ["Celluloid::Supervision::Service::Root", 1]]

Nov 29 '17 17:11 innovia

btw i not think you need to monitor children for postgres or nginx, its quite strange use case.

Nov 29 '17 18:11 kostya

I just wanted to see how much each of the processes are taking but now it's seems like eye is the cause of that memory leak.

Once we stablize it I'll remove these children.

Nov 29 '17 18:11 innovia

here it is after a day 6GB!

root@reDash:~# eye x
about:     Eye v0.9.2 (c) 2012-2016 @kostya
resources: Nov29, 0%, 6093Mb, <18658>
ruby:      ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux]
gems:      ["Celluloid=0.17.3", "Celluloid::IO=0.17.3", "StateMachines=0.5.0", "NIO=2.1.0", "Timers=4.1.2", "Sigar=1.7.0.0"]
logger:    /opt/redash/logs/eye.log
home:      /root
dir:       /var/run/eye
pid_path:  /var/run/eye/pid
sock_path: /var/run/eye/sock
actors:    [["Eye::ChildProcess", 32], ["Eye::Process", 7], ["Eye::Group", 1], ["Eye::Controller", 1], ["Eye::SystemResources::Cache", 1], ["Eye::Server", 1], ["Celluloid::Supervision::Service::Public", 1], ["Celluloid::IncidentReporter", 1], ["Celluloid::Notifications::Fanout", 1], ["Celluloid::Supervision::Service::Root", 1]]

Nov 30 '17 22:11 innovia

This is really strange, i not see leaking in actors. So hard to say where it can be. Can you show full config? may be you have some global variables which not cleans. You have only 7 processes. I have eye process running for half year for 156 processes. Also there is difference in ruby, and also in gems StateMachines, NIO, Timers (which potentially can give leak).

about:     Eye v0.9.1 (c) 2012-2016 @kostya
resources: Apr26, 0%, 106Mb, <32291>
ruby:      ruby 1.9.3p484 (2013-11-22) [x86_64-linux]
gems:      ["Celluloid=0.17.3", "Celluloid::IO=0.17.1", "StateMachines=0.4.0", "NIO=1.1.1", "Timers=4.1.1", "Sigar=1.7.0.0"]
logger:    /projects/client/log/eye.log
home:      /home/deploy
dir:       /home/deploy/.eye
pid_path:  /home/deploy/.eye/pid
sock_path: /home/deploy/.eye/sock
actors:    [["Eye::Process", 156], ["Eye::Group", 29], ["Eye::Server", 1], ["Eye::SystemResources::Cache", 1], ["Eye::Controller", 1], ["Celluloid::Supervision::Service::Public", 1], ["Celluloid::IncidentReporter", 1], ["Celluloid::Notifications::Fanout", 1], ["Celluloid::Supervision::Service::Root", 1]]

Nov 30 '17 23:11 kostya

im trying to reinstall the server on 16.04, will let you know

Nov 30 '17 23:11 innovia

i rerun my eye with

ruby:      ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-linux]
gems:      ["Celluloid=0.17.3", "Celluloid::IO=0.17.3", "StateMachines=0.5.0", "NIO=2.1.0", "Timers=4.1.2", "Sigar=1.7.0.0"]

so will see, is there any leak or not

Nov 30 '17 23:11 kostya

i not see any leaks with new gems and ruby 2.2.2, so may be this is problem in config

Dec 01 '17 21:12 kostya

here's my config

#!/usr/bin/env ruby
Eye.load '/etc/eye/mailer.rb' # mailer set params (like variables)
Eye.load '/etc/eye/cloudwatch.rb'
Eye.load '/etc/eye/config.rb' # config assign params values

Eye.application :reDash do
  working_dir "/opt/redash/current"

  load_env "/root/.env"

  trigger :flapping, times: 3, within: 1.minute, retry_in: 5.minutes
  notify :by_email, :info
  notify :cloudwatch, :info

  process(:postgresql) do
    pid_file "/var/run/postgresql/9.3-main.pid"
    stdall "/opt/redash/logs/postgresql"
    start_command "service postgresql start"
    stop_command "service postgresql stop"
    restart_command "service postgresql restart"
  end

  process(:nginx) do
    depend_on :gunicorn
    pid_file "/var/run/nginx.pid"
    stdall "/opt/redash/logs/nginx.log"
    start_command "/usr/sbin/nginx"
    stop_signals [:QUIT, 30.seconds, :TERM, 15.seconds, :KILL]
    restart_command "kill -HUP {PID}"
    daemonize true
  end

  process(:redis) do
    pid_file "/var/run/redis.pid"
    stdall "/opt/redash/logs/redis.log"
    start_command "/usr/local/bin/redis-server /etc/redis/6379.conf"
    stop_signals [:TERM, 30.seconds, :QUIT]
    restart_command  "kill -HUP {{PID}}"
    daemonize true
  end

  process(:gunicorn) do
    uid 'redash'
    gid 'nogroup'
    depend_on :redis
    pid_file "/var/run/gunicorn/gunicorn.pid"
    stdall "/opt/redash/logs/gunicorn.log"
    start_command "gunicorn -b unix:///var/run/gunicorn/gunicorn.sock --name redash -w 9 --max-requests 1000 redash.wsgi:app"
    stop_signals [:TERM, 30.seconds, :QUIT]
    restart_command  "kill -HUP {{PID}}"
    daemonize true
    monitor_children do
      stop_command "kill -TERM {PID}"
      check :cpu, :every => 30, :below => 80, :times => 3
      check :memory, :every => 30, :below => 350.megabytes, :times => [3,5]
    end
  end

  process(:flower) do
    uid 'redash'
    gid 'nogroup'
    pid_file "/var/run/celery/flower.pid"
    stdall "/opt/redash/logs/flower.log"
    start_command "celery flower -A redash.worker --logging=debug --broker=redis://127.0.0.1:6379/0 --broker_api=redis://127.0.0.1:6379/0 --address=0.0.0.0 --port=5555 --persistent"
    stop_signals [:TERM, 30.seconds, :QUIT]
    restart_command  "kill -HUP {{PID}}"
    check :cpu, :every => 30, :below => 80, :times => 3
    check :memory, :every => 30, :below => 250.megabytes, :times => [3,5]
    daemonize true
  end

  process(:celery_scheduler) do
    uid 'redash'
    gid 'nogroup'
    pid_file "/var/run/celery/celery_schedule_worker.pid"
    stdall "/opt/redash/logs/celery_schedule_worker.log"
    start_command "celery worker --app=redash.worker --beat --concurrency=4 --queues=queries,celery --maxtasksperchild=100 --events -Ofair --autoscale=10,4 -n redash_celery_scheduled@%h"
    stop_signals [:TERM, 30.seconds, :QUIT]
    restart_command  "kill -HUP {{PID}}"
    daemonize true
    monitor_children do
      stop_command "kill -TERM {PID}"
      check :cpu, :every => 30, :below => 80, :times => 3
      check :memory, :every => 30, :below => 512.megabytes, :times => [3,5]
    end
  end

   process(:celery_worker) do
    uid 'redash'
    gid 'nogroup'
    pid_file "/var/run/celery/celery_worker.pid"
    stdall "/opt/redash/logs/celery_worker.log"
    start_command "celery worker --app=redash.worker --concurrency=4 --queues=scheduled_queries --maxtasksperchild=100 --events -Ofair  --autoscale=10,4 -n redash_celery_worker@%h"
    stop_signals [:TERM, 30.seconds, :QUIT]
    restart_command  "kill -HUP {{PID}}"
    daemonize true
    monitor_children do
      stop_command "kill -TERM {PID}"
      check :cpu, :every => 30, :below => 80, :times => 3
      check :memory, :every => 30, :below => 400.megabytes, :times => [3,5]
    end
  end
end

Dec 04 '17 10:12 innovia

not see anything bad in config: i think we should try 3 things:

me for try ruby 2.4.1 (cause i try only 2.2.2)
you try remove monitor_children and see if leak still here.
you try remove depend_on and see if leak still here.

btw using restart_command and daemonize true quite strange. for example unicorn also used restart_command, but it run with daemonize false, because it potentially can create some problems with restarts.

Dec 04 '17 10:12 kostya

Im trying to downgrade to 2.2.2

i'll do what you suggested

Dec 04 '17 11:12 innovia

downgrading to 2.2.2 and disabling all of these dependancies and children stabilized it at 100Mb

i'll try adding back the children monitoring

Dec 06 '17 07:12 innovia

how it is?

Dec 07 '17 15:12 kostya

its growing

now at 700MB

eye x about: Eye v0.9.2 (c) 2012-2016 @kostya resources: Dec04, 0%, 776Mb, <1089> ruby: ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-linux] gems: ["Celluloid=0.17.3", "Celluloid::IO=0.17.3", "StateMachines=0.5.0", "NIO=2.1.0", "Timers=4.1.2", "Sigar=1.7.0.0"] logger: /opt/redash/logs/eye.log home: /home/ubuntu dir: /var/run/eye pid_path: /var/run/eye/pid sock_path: /var/run/eye/sock actors: [["Eye::ChildProcess", 18], ["Eye::Process", 7], ["Eye::Notify::Mail", 1], ["Eye::Group", 1], ["Eye::Controller", 1], ["Eye::SystemResources::Cache", 1], ["Eye::Server", 1], ["Celluloid::Supervision::Service::Public", 1], ["Celluloid::IncidentReporter", 1], ["Celluloid::Notifications::Fanout", 1], ["Celluloid::Supervision::Service::Root", 1]]

Dec 07 '17 18:12 innovia

ok, can you test separately monitor_children and depend_on, to see where bug is.

Dec 07 '17 20:12 kostya

I didnt add depend_on at all

Dec 07 '17 20:12 innovia

so bug in monitor_children, it remove all of it, it just works without leaks.

Dec 07 '17 21:12 kostya

confirm that leak in monitor_children, i'll try to fix it.

Dec 08 '17 10:12 kostya