runc running very slow
runc --systemd-cgroup run test takes 10 minutes long to execute, but no error or warning.
here're some system info:
root@compute20:~/dominic/testrunc# cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
root@compute20:~/dominic/testrunc#
root@compute20:~/dominic/testrunc# uname -a
Linux compute20 4.15.0-72-generic #81-Ubuntu SMP Tue Nov 26 12:20:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
root@compute20:~/dominic/testrunc#
root@compute20:~/dominic/testrunc#
root@compute20:~/dominic/testrunc# runc -v
runc version 1.0.0-rc8
commit: 425e105d5a03fabd737a126ad93d62a9eeede87f
spec: 1.0.1-dev
root@compute20:~/dominic/testrunc#
I modified the source code of runc , added some trace codes, and got the go trace file:
and found the syscall were blocked:

I'm not familiar with the dbus system, does anybody knows what was happened on my OS? or how to debug more deeply?
What you found means runc waits for systemd (to provide a response over dbus, I guess).
Not sure why it's taking so long -- something is wrong with your systemd?
If there's a way to reproduce this from scratch -- please provide detailed steps.
@kolyshkin I haven't found the way to reproduce this, but this is the only server out of dozens in my k8s cluster that occurred twice in the past six months. I'll try to find it out.
update: Today it happened again, we found there're too many multipathes (up to 12k) , while the multipathd using too many linux fd. And there're lots of timeout error in dbus journal log. So I think it many be the root cause of slowly runc, but didn't found more direct evidence that the leaked multipathes blocked the dbus.
As for the multipath leak, it's something wrong with the OpenStack running on that host.
root@compute20:~#
root@compute20:~# lsof -n 2>/dev/null |awk '{print $2}'|sort|uniq -c|sort -nr | head -n 10
1040438 147156
279533 105889
45087 61523
39130 57905
25595 2369946
9398 61515
5805 985191
5805 3424373
5805 3307072
5805 3050402
root@compute20:~#
root@compute20:~# ps -ef | grep 147156
root 147156 1 0 Mar08 ? 13:31:31 /sbin/multipathd -d -s
root 3054681 2989740 0 16:08 pts/4 00:00:00 grep --color=auto 147156
root@compute20:~#
root@compute20:~#
root@compute20:~# multipath -ll | wc -l
8965
root@compute20:~#
root@compute20:~#
root@compute20:~# systemctl status dbus
● dbus.service - D-Bus System Message Bus
Loaded: loaded (/lib/systemd/system/dbus.service; static; vendor preset: enabled)
Active: active (running) since Tue 2022-03-08 22:19:15 CST; 2 months 5 days ago
Docs: man:dbus-daemon(1)
Main PID: 55898 (dbus-daemon)
Tasks: 1 (limit: 599999)
CGroup: /system.slice/dbus.service
└─55898 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
May 14 16:00:11 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:00:36 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:01:01 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:01:27 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:01:52 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:02:17 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:02:42 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:03:07 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:03:32 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
May 14 16:03:57 compute20 dbus-daemon[55898]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
root@compute20:~#
After we cleaned up all fault multipathes, dbus syscall still blocking.
TBC...
Team,
Any progress on this ? We are seeing a similar issue occurring randomly on the production nodes.
Regards, Jas
There were similar issues with systemd communication via dbus caused by the way runc was protecting the executable against attacks. #3931 should have fixed this issue -- you can test with that code to see if it resolves the issue.
There were similar issues with systemd communication via dbus caused by the way runc was protecting the executable against attacks. #3931 should have fixed this issue -- you can test with that code to see if it resolves the issue.
Thank you for your reply, I tried with this fix on another node which has similar issue, and seems works well. I'll close this issue.