runc icon indicating copy to clipboard operation
runc copied to clipboard

runc running very slow

Open ydcool opened this issue 3 years ago • 5 comments

runc --systemd-cgroup run test takes 10 minutes long to execute, but no error or warning.

here're some system info:

root@compute20:~/dominic/testrunc# cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
root@compute20:~/dominic/testrunc#
root@compute20:~/dominic/testrunc# uname -a
Linux compute20 4.15.0-72-generic #81-Ubuntu SMP Tue Nov 26 12:20:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
root@compute20:~/dominic/testrunc#
root@compute20:~/dominic/testrunc#
root@compute20:~/dominic/testrunc# runc -v
runc version 1.0.0-rc8
commit: 425e105d5a03fabd737a126ad93d62a9eeede87f
spec: 1.0.1-dev
root@compute20:~/dominic/testrunc#

I modified the source code of runc , added some trace codes, and got the go trace file: block

and found the syscall were blocked: image

I'm not familiar with the dbus system, does anybody knows what was happened on my OS? or how to debug more deeply?

ydcool avatar May 02 '22 12:05 ydcool

What you found means runc waits for systemd (to provide a response over dbus, I guess).

Not sure why it's taking so long -- something is wrong with your systemd?

If there's a way to reproduce this from scratch -- please provide detailed steps.

kolyshkin avatar May 03 '22 00:05 kolyshkin

@kolyshkin I haven't found the way to reproduce this, but this is the only server out of dozens in my k8s cluster that occurred twice in the past six months. I'll try to find it out.

ydcool avatar May 04 '22 11:05 ydcool

update: Today it happened again, we found there're too many multipathes (up to 12k) , while the multipathd using too many linux fd. And there're lots of timeout error in dbus journal log. So I think it many be the root cause of slowly runc, but didn't found more direct evidence that the leaked multipathes blocked the dbus.

As for the multipath leak, it's something wrong with the OpenStack running on that host.

root@compute20:~#
root@compute20:~# lsof -n 2>/dev/null |awk '{print $2}'|sort|uniq -c|sort -nr | head -n 10
1040438 147156
 279533 105889
  45087 61523
  39130 57905
  25595 2369946
   9398 61515
   5805 985191
   5805 3424373
   5805 3307072
   5805 3050402
root@compute20:~#
root@compute20:~# ps -ef | grep 147156
root      147156       1  0 Mar08 ?        13:31:31 /sbin/multipathd -d -s
root     3054681 2989740  0 16:08 pts/4    00:00:00 grep --color=auto 147156
root@compute20:~#
root@compute20:~#
root@compute20:~# multipath -ll | wc -l
8965
root@compute20:~#
root@compute20:~#
root@compute20:~# systemctl status dbus
● dbus.service - D-Bus System Message Bus
   Loaded: loaded (/lib/systemd/system/dbus.service; static; vendor preset: enabled)
   Active: active (running) since Tue 2022-03-08 22:19:15 CST; 2 months 5 days ago
     Docs: man:dbus-daemon(1)
 Main PID: 55898 (dbus-daemon)
    Tasks: 1 (limit: 599999)
   CGroup: /system.slice/dbus.service
           └─55898 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only

May 14 16:00:11 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:00:36 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:01:01 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:01:27 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:01:52 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:02:17 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:02:42 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:03:07 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:03:32 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:03:57 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
root@compute20:~#

After we cleaned up all fault multipathes, dbus syscall still blocking.

TBC...

ydcool avatar May 14 '22 08:05 ydcool

Team,

Any progress on this ? We are seeing a similar issue occurring randomly on the production nodes.

Regards, Jas

jasvinder1107 avatar Aug 16 '23 00:08 jasvinder1107

There were similar issues with systemd communication via dbus caused by the way runc was protecting the executable against attacks. #3931 should have fixed this issue -- you can test with that code to see if it resolves the issue.

cyphar avatar Aug 16 '23 06:08 cyphar

There were similar issues with systemd communication via dbus caused by the way runc was protecting the executable against attacks. #3931 should have fixed this issue -- you can test with that code to see if it resolves the issue.

Thank you for your reply, I tried with this fix on another node which has similar issue, and seems works well. I'll close this issue.

ydcool avatar Apr 04 '24 09:04 ydcool