When the systemd is busy, runc init will hang and cannot exit, eventually ending up in D state.
Description
In the scenario where a single node concurrently deploys more than 100 containers, the call chain is kubelet -> containerd -> containerd-shim-runc-v2 ->runc root create -> runc init -> dbus -> systemd -> cgroup. However, systemd is single-threaded and when it's busy, it continuously occupies one core. This causes the runc main process to be unable to get a response in a timely manner, with no timeout mechanism, leading to a deadlock. The related flame graph is as follows:
As shown, a main goroutine is blocked waiting for a message from a channel, while another goroutine, ReadMsgUnix, never receives a response, causing the main goroutine to block indefinitely.
The primary issue lies in the fact that runc init, as a client, doesn't incorporate a timeout mechanism. A potential solution would be to introduce a timeout, enabling runc to actively disconnect in the event of abnormal scenarios, such as systemd being overly busy. We could modify the context for the startUnit function in libcontainer/cgroups/systemd/common.go to include a timeout. Currently, the context for startUnit is context.TODO(), which has no timeout. We could change this to contextWithTimeout, _ := context.WithTimeout(context.Background(), 30*time.Second), ensuring that runc would not wait indefinitely in the face of unresponsive components.
Steps to reproduce the issue
- a single node deploys more than 100 pod at the same time with k8s
Describe the results you received and expected
- The process will not end up in a D (uninterruptible sleep) state.
What version of runc are you using?
1.1.2
Host OS information
x86
Host kernel information
Linux master1 5.10.0-60.18.0.50.h665.eulerosv2r11.x86_64 #1 SMP Fri Dec 23 16:12:27 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
runc init is there for a reason -- you can use runc create and runc start separately, and in between these two invocations runc init is a process that keeps a created container.
OTOH it makes sense to pass a context with timeout to c.StartTransientUnitContext.
Can you check if something like this fixes your issue? The patch is against runc main branch.
diff --git a/libcontainer/cgroups/systemd/common.go b/libcontainer/cgroups/systemd/common.go
index c577a22b..c970e941 100644
--- a/libcontainer/cgroups/systemd/common.go
+++ b/libcontainer/cgroups/systemd/common.go
@@ -130,7 +130,9 @@ func startUnit(cm *dbusConnManager, unitName string, properties []systemdDbus.Pr
retry:
err := cm.retryOnDisconnect(func(c *systemdDbus.Conn) error {
- _, err := c.StartTransientUnitContext(context.TODO(), unitName, "replace", properties, statusChan)
+ ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
+ defer cancel()
+ _, err := c.StartTransientUnitContext(ctx, unitName, "replace", properties, statusChan)
return err
})
if err != nil {
@kolyshkin
After studying the source code, I found that when the cgroup is set to v2, the Set method no longer needs to freeze and thaw. However, even if I switch to cgroup v2, some scenarios still require freezing and thawing. For instance, the signalAllProcesses method, regardless of whether it's cgroup v1 or cgroup v2, still requires freezing and thawing. I'm considering switching to cgroup v2 and no longer performing any freeze and thaw logic. This consideration is due to extreme scenarios where, after the runc task freezes the runc init and if the runc task hangs, there will be no process to thaw the runc init. This leads to the runc init entering a D state and, with its parent process being the systemd process, the runc init becomes an orphan process. I'm wondering if I stop executing all freeze and thaw logic when cgroup is v2, except when explicitly set to a freeze state via command line, would this approach cause any issues?
libcontainer/cgroups/systemd/v2.go
func (m *UnifiedManager) Set(r *configs.Resources) error {
if r == nil {
return nil
}
properties, err := genV2ResourcesProperties(m.fsMgr.Path(""), r, m.dbus)
if err != nil {
return err
}
if err := setUnitProperties(m.dbus, getUnitName(m.cgroups), properties...); err != nil {
return fmt.Errorf("unable to set unit properties: %w", err)
}
return m.fsMgr.Set(r)
}
libcontainer/init_linux.go
// signalAllProcesses freezes then iterates over all the processes inside the
// manager's cgroups sending the signal s to them.
// If s is SIGKILL then it will wait for each process to exit.
// For all other signals it will check if the process is ready to report its
// exit status and only if it is will a wait be performed.
func signalAllProcesses(m cgroups.Manager, s os.Signal) error {
var procs []*os.Process
if err := m.Freeze(configs.Frozen); err != nil {
logrus.Warn(err)
}
pids, err := m.GetAllPids()
if err != nil {
if err := m.Freeze(configs.Thawed); err != nil {
logrus.Warn(err)
}
return err
}
for _, pid := range pids {
p, err := os.FindProcess(pid)
if err != nil {
logrus.Warn(err)
continue
}
procs = append(procs, p)
if err := p.Signal(s); err != nil {
logrus.Warn(err)
}
}
if err := m.Freeze(configs.Thawed); err != nil {
logrus.Warn(err)
}
subreaper, err := system.GetSubreaper()
if err != nil {
// The error here means that PR_GET_CHILD_SUBREAPER is not
// supported because this code might run on a kernel older
// than 3.4. We don't want to throw an error in that case,
// and we simplify things, considering there is no subreaper
// set.
subreaper = 0
}
for _, p := range procs {
if s != unix.SIGKILL {
if ok, err := isWaitable(p.Pid); err != nil {
if !errors.Is(err, unix.ECHILD) {
logrus.Warn("signalAllProcesses: ", p.Pid, err)
}
continue
} else if !ok {
// Not ready to report so don't wait
continue
}
}
// In case a subreaper has been setup, this code must not
// wait for the process. Otherwise, we cannot be sure the
// current process will be reaped by the subreaper, while
// the subreaper might be waiting for this process in order
// to retrieve its exit code.
if subreaper == 0 {
if _, err := p.Wait(); err != nil {
if !errors.Is(err, unix.ECHILD) {
logrus.Warn("wait: ", err)
}
}
}
}
return nil
}
In the signalAllProcesses method, it's unclear why there is a need for freeze and thaw operations. In most scenarios, the invocations of m.GetAllPids() and p.Signal(s) don't require freeze and thaw. So why does this particular method demand a freeze operation?
@113xiaoji I'm sorry, we were discussing the runc hang caused by non-responding systemd in this issue, and now it's about freezing. Maybe you added a comment to the wrong issue, and in fact you mean https://github.com/opencontainers/runc/issues/3803? Please clarify
As for using signalAllProcesses, its logic (and the logic of its users) was changed in https://github.com/opencontainers/runc/pull/3825.
@113xiaoji I'm sorry, we were discussing the runc hang caused by non-responding systemd in this issue, and now it's about freezing. Maybe you added a comment to the wrong issue, and in fact you mean #3803? Please clarify
As for using signalAllProcesses, its logic (and the logic of its users) was changed in #3825.
@kolyshkin Thank you for your response. In this issue, we are focusing on the runc hang problem caused by systemd not responding. I believe that all APIs related to interacting with systemd, such as: startUnit, stopUnit, resetFailedUnit, getUnitTypeProperty, setUnitProperties, should have a timeout mechanism added. Otherwise, when systemd is busy, we can't predict exactly at which step it will hang. In our scenario, a large number of Pods are being deployed, and since systemd needs to listen for changes in /proc/self/mountinfo and update synchronously, when there are a large number of mount points, it causes systemd to become extremely busy. Related issue are https://github.com/opencontainers/runc/issues/2532.,https://github.com/systemd/systemd/issues/15464
@113xiaoji I'm sorry, we were discussing the runc hang caused by non-responding systemd in this issue, and now it's about freezing. Maybe you added a comment to the wrong issue, and in fact you mean #3803? Please clarify
As for using signalAllProcesses, its logic (and the logic of its users) was changed in #3825.
I will discuss the issue of runc init becoming an orphan process in #3803 and #3825.
I wouldn't mind adding timeouts, but it should be noted that this will not result in containers starting without issue -- it will result in containers failing to start in cases where we need to talk to systemd to start our containers. In the case of Kubernetes I would expect Kubernetes to try to start the container again which would probably not help with the system load issue. It should be noted that the original issue that caused the systemd busy-ness problem has been fixed (#2532).