tutor icon indicating copy to clipboard operation
tutor copied to clipboard

Extremely high memory utilization running locally on certain linux distributions

Open e0d opened this issue 2 years ago • 5 comments

Bug description On Redhat variants containers are taking an unreasonably large amount of memory running

tutor local start

The issue is especially acute with the mysql container, but is also an issue with the lms and cms.

image

Memory is consumed until the OOM kills the container, which is then immediately restarted.

How to reproduce

Everything is vanilla.

Simply using

tutor local start

Environment OS: 5.17.7-200.fc35.x86_64 Tutor: tutor, version 13.2.2

Additional context This can/has been resolved by adding the following to a few service definitions in the local version of docker-compose.yml

mysql:
    blah: foo
    ...
    ulimits:
      nproc: 65535
      nofile:
        soft: 26677
        hard: 46677

If you are amenable, I can submit a PR to add this to the template with config variables. Sane limits seem good generally. I would consider adding them to mysql, mongo, elasticsearch, lms, and cms.

After picture looks like so:

image

e0d avatar May 19 '22 13:05 e0d

I need to investigate this further. A quick search has yielded the following results:

  • https://github.com/docker-library/mysql/issues/579
  • https://bugzilla.redhat.com/show_bug.cgi?id=1708115

regisb avatar May 19 '22 13:05 regisb

I think it could be a bug between docker and your distro/kernel. Here I use a Swarm Cluster in production and I add resources config in the docker-compose file. Running tutor locally I've never had this problem.

Example:

services:
  service:
    image: nginx
    deploy:
        resources:
            limits:
              cpus: 0.50
              memory: 512M
            reservations:
              cpus: 0.25
              memory: 128M

You can do the same using customizing-the-deployed-services.

erickhgm avatar May 19 '22 14:05 erickhgm

Docker sometimes behaves strange, and is hard to debug in that case. For another customer, running docker on an old Fedora installation (basically a thumbnailer with puppeteer and puppeteer cluster), using Yelp dumb-init, after 2...3 days the whole docker systemd process was nuked, with no helpful entry in the logs. I switched to tini (thanks to Tutor inspiration), now the container is up and running more than a week already without any problem!

insad avatar May 19 '22 17:05 insad

I'm not a big fan of settings ulimits in stone for the Open edX containers. It might help us resolve this particular bug, but other issues are sure to appear later when some people need to exceed these limits. I suspect that there is an underlying bug not related to Tutor. Can you investigate this issue further @e0d? In particular, what version of Docker are you running?

regisb avatar May 23 '22 14:05 regisb

Using overrides is an acceptable solution, but personally I prefer setting sane limits because the failure mode is better. If the container cannot start because it needs more memory than it is allowed to have, that is "fast failure" and the message is likely clear. When a container takes all of your free RAM such that the system becomes nearly unresponsive and the OOM is fighting Docker compose, it's a bit of a mess.

I suspect that there's probably a system-wide limit in MacOS and Ubuntu that prevents MySQL from using as much memory as it can. From what I can tell, this would be an issue in RedHat variants and Arch.

I'm using Docker version 20.10.16.

If nobody else ever sees the issue, it may not be worth the effort.

e0d avatar May 23 '22 14:05 e0d

Adding a +1 to this since I experienced it as well for the first time today. I fixed it by following these instructions.

keithgg avatar Mar 09 '23 13:03 keithgg

Thanks for commenting @keithgg. Can you please revert your fix so that we can diagnose a little more precisely? What's your OS?

What's the output of the following commands?

$ cat /proc/$(pgrep dockerd)/limits
$ systemctl cat docker.service | grep LimitNOFILE

regisb avatar Mar 09 '23 13:03 regisb

@regisb

Can you please revert your fix so that we can diagnose a little more precisely?

Sure, just let me know what you need. I took this screenshot before making the fix. The LMS/CMS and MySQL containers are all using as much memory as they can. FWIW, I don't think this is necessarily a tutor issue. There's more discussion happening here.

Screenshot_20230309_115442

What's your OS?

I'm running EndeavourOS (which is basically Arch). My limits are have always been high, because Javascript. I didn't change any of them when trying to fix this issue. Just made the fix mentioned above.

$ cat /proc/$(pgrep dockerd)/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             unlimited            unlimited            processes 
Max open files            1073741816           1073741816           files     
Max locked memory         8388608              8388608              bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       192261               192261               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
$ systemctl cat docker.service | grep LimitNOFILE
LimitNOFILE=infinity

keithgg avatar Mar 09 '23 13:03 keithgg

OK. I think I have now a better understanding of what this issue is.

For me, the fact that the issue affects multiple containers (LMS, CMS, mysql) is a confirmation that we should not hardcode ulimits for everyone.

Still, users on RedHat/Fedora/Arch are bound to face this issue. So at the very least we should add a section to the troubleshooting docs: https://docs.tutor.overhang.io/troubleshooting.html

@keithgg would you like to open a PR or should I do it?

regisb avatar Mar 09 '23 15:03 regisb

I didn't have time to dig deeply into this, so my solution is rather ham-fisted.

In .local/share/tutor/env/local/docker-compose-override.yml I have added:

services:
    mysql:
      deploy:
        resources:
          limits:
            memory: 2G      
      ulimits:
        nproc: 65535
        nofile:
          soft: 26677
          hard: 46677
    credentials:
      deploy:
        resources:
          limits:
            memory: 2G 
    discovery:
      deploy:
        resources:
          limits:
            memory: 2G 
    lms:
      deploy:
        resources:
          limits:
            memory: 2G 
    cms:
      deploy:
        resources:
          limits:
            memory: 2G 
    ecommerce:
      deploy:
        resources:
          limits:
            memory: 2G
    mongodb:
      deploy:
        resources:
          limits:
            memory: 2G 
    redis:
      deploy:
        resources:
          limits:
            memory: 2G 
    elasticsearch:
      deploy:
        resources:
          limits:
            memory: 2G
    lms-worker:
      deploy:
        resources:
          limits:
            memory: 2G
    cms-worker:
      deploy:
        resources:
          limits:
            memory: 2G
    ecommerce-worker:
      deploy:
        resources:
          limits:
            memory: 2G

e0d avatar Mar 09 '23 15:03 e0d

@keithgg would you like to open a PR or should I do it?

@regisb I'll leave this one to you :slightly_smiling_face: . Just a note that my daemon.json my limits are 256000 instead of the 64000 in the link. When it was lower I got golang errors of the form panic: runtime error: index out of range [159] with length 145 building images for dev.

{
        "default-ulimits": {
                "nofile": {
                        "Hard": 256000,
                        "Name": "nofile",
                        "Soft": 256000
                }
        }
}

keithgg avatar Mar 10 '23 08:03 keithgg