atop icon indicating copy to clipboard operation
atop copied to clipboard

Atop service may not start on high core count boxes

Open ryanbowen opened this issue 1 year ago • 1 comments
trafficstars

This was a fun one to debug... it seems that atop has 2 * $CPUS + 30(ish) open file descriptors when run as root. When running atop as service on high core count boxes this has the potential to put it over the default limit set by systemd of 1024 which causes it to not start.

It tends to present as a failure to open /proc/loadavg which I'm guessing is the first place that a failed open is fatal:

May 10 12:27:01 host01 systemd[1]: Starting Atop advanced performance monitor...
May 10 12:27:01 host01 systemd[1]: Started Atop advanced performance monitor.
May 10 12:27:01 host01 sh[1533798]: can not open /proc/loadavg
May 10 12:27:01 host01 systemd[1]: atop.service: Main process exited, code=exited, status=53/n/a
May 10 12:27:01 host01 systemd[1]: atop.service: Failed with result 'exit-code'.

For reference, on this host it breaks:

root@host01:~# lscpu | grep '^CPU(s)'
CPU(s):              512
root@host01(toa):~# lsof -p `pgrep -x atop` | wc -l
1055

On this one it's fine:

root@host02:~# lscpu | grep '^CPU(s):'
CPU(s):              64
root@host02(psc|qa):~$ lsof -p `pgrep -x atop --newest` | wc -l
159

ryanbowen avatar May 10 '24 22:05 ryanbowen

We can update the systemd configuration in https://github.com/Atoptool/atop/blob/master/atop.service to include:

LimitNOFILE=4096

sreerajkksd avatar May 16 '24 17:05 sreerajkksd

Great debugging! Atop will automatically increase the number of allowed open files now to the limit which is needed for the current number of CPUs. This solution is preferred above setting a fixed number in the atop.service file which would introduce another (higher) limit again. Besides, this will not solve this issue for an interactive run.

Atoptool avatar Jun 15 '24 19:06 Atoptool