'unable to get systemd version' when using systemd cgroup driver
Description
We've been using the systemd cgroup driver for years, like so:
Command::new("runc")
.arg("--systemd-cgroup")
We're experiencing a consistent but low frequency container startup error that we'd like guidance on how to eliminate or be robust to.
running container: creating container: cannot set up cgroup for root: error parsing systemd version: unable to get systemd version
This is coming from: https://github.com/google/gvisor/blob/973b2f23e56686780f85560c1ec37fe6a0bc4c9e/runsc/cgroup/systemd.go#L268
func systemdVersion(conn *systemdDbus.Conn) (int, error) {
vStr, err := conn.GetManagerProperty("Version")
if err != nil {
return -1, errors.New("unable to get systemd version")
}
Based on observance of host metrics when this failure happens it seems related to load on the host, but our standard metrics (CPU, RAM, PSI) look fairly normal.
To make progress on this issue, we're considering:
- Logging the actual
errinstead of the hardcoded message. - Adding a fallback to an environment read, or allowing that as an override
We don't really want to retry the container creation in this situation. We'd prefer a solution which either internally retries, if that's necessary.
Steps to reproduce
This is a sporadic error so we can't provide a reproduction.
runsc version
-version
runsc version fb842aab7730
spec: 1.2.0
docker version (if using docker)
uname
Linux ip-10-110-45-137.sa-east-1.compute.internal 5.15.0-309.180.4.el9uek.x86_64 #2 SMP Wed May 21 06:56:22 PDT 2025 x86_64 x86_64 x86_64 GNU/Linux
kubectl (if using Kubernetes)
n/a
repo state (if built from source)
No response
runsc debug logs (if available)
Have you tried adding retries or using an env variable as an override, and have had success with this in practice? My guess would be that if this dbus request fail, retries and later dbus requests will fail as well.
My guess would be that if this dbus request fail...
I think that's likely too.
We haven't tried any intervention yet. We can start with
Logging the actual err instead of the hardcoded message.
Hi! We're (me and @22aronl) currently students at UT taking a virtualization course, and we'd like to take this on.
We wanted to confirm the actual intention/plan mentioned in the discussion. We would replace the hardcoded "unable to get systemd version" error with one that wraps the actual underlying D-Bus error so users can diagnose intermittent failures. Beyond that, maybe we can add an optional retry (with a small, bounded backoff) since host load might correlate with this lookup occasionally failing.
Let us know if this direction aligns with what you expect. If so, I’ll proceed with a patch.
Thanks!