runc unable to set memory limit to 20971520 (current usage: 21401600, peak usage: 21536768): unknown

Description

When using logic from #3931, we discarded bindfd and adopted memfd. The pod has two containers: a main container and a sidecar. The request for the sidecar container is set to 10Mb and limit is 20MB. When I attempt to delete the pod and rebuild it, I face the following error:

Steps to reproduce the issue

1.Create a container with a memory limit set to 20MB.
2.Start it using the memfd method.
3.Check the value in mem.usage_in_bytes.

Alternatively, when used with Kubernetes: The pod has two containers: a primary container and a sidecar. The request for the sidecar container is set to 10Mb and the limit is 20MB. When I delete the pod, I wait for the pod to be rebuilt.

Describe the results you received and expected

Error Log:

    Message:      failed to create containerd task: failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: unable to set memory limit to 20971520 (current usage: 21401600, peak usage: 21536768): unknown

code

func setMemory(path string, val int64) error {
	if val == 0 {
		return nil
	}

	err := cgroups.WriteFile(path, cgroupMemoryLimit, strconv.FormatInt(val, 10))
	if !errors.Is(err, unix.EBUSY) {
		return err
	}

	// EBUSY means the kernel can't set new limit as it's too low
	// (lower than the current usage). Return more specific error.
	usage, err := fscommon.GetCgroupParamUint(path, cgroupMemoryUsage)
	if err != nil {
		return err
	}
	max, err := fscommon.GetCgroupParamUint(path, cgroupMemoryMaxUsage)
	if err != nil {
		return err
	}

	return fmt.Errorf("unable to set memory limit to %d (current usage: %d, peak usage: %d)", val, usage, max)
}

code

		case procHooks:
			// Setup cgroup before prestart hook, so that the prestart hook could apply cgroup permissions.
			if err := p.manager.Set(p.config.Config.Cgroups.Resources); err != nil {
				return fmt.Errorf("error setting cgroup config for procHooks process: %w", err)
			}
			if p.intelRdtManager != nil {
				if err := p.intelRdtManager.Set(p.config.Config); err != nil {
					return fmt.Errorf("error setting Intel RDT config for procHooks process: %w", err)
				}
			}
			if len(p.config.Config.Hooks) != 0 {
				s, err := p.container.currentOCIState()
				if err != nil {
					return err
				}
				// initProcessStartTime hasn't been set yet.
				s.Pid = p.cmd.Process.Pid
				s.Status = specs.StateCreating
				hooks := p.config.Config.Hooks

				if err := hooks[configs.Prestart].RunHooks(s); err != nil {
					return err
				}
				if err := hooks[configs.CreateRuntime].RunHooks(s); err != nil {
					return err
				}
			}
			// Sync with child.
			if err := writeSync(p.messageSockPair.parent, procResume); err != nil {
				return err
			}
			sentResume = true

Upon checking move_charge_at_immigrate, it's not enabled, and I'm on cgroupv1.

Upon examining the kernel 4.18 source code:

static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
{
	struct cgroup_subsys_state *css;
	struct mem_cgroup *memcg = NULL; /* unneeded init to make gcc happy */
	struct mem_cgroup *from;
	struct task_struct *leader, *p;
	struct mm_struct *mm;
	unsigned long move_flags;
	int ret = 0;

	/* charge immigration isn't supported on the default hierarchy */
	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
		return 0;

	/*
	 * Multi-process migrations only happen on the default hierarchy
	 * where charge immigration is not used.  Perform charge
	 * immigration if @tset contains a leader and whine if there are
	 * multiple.
	 */
	p = NULL;
	cgroup_taskset_for_each_leader(leader, css, tset) {
		WARN_ON_ONCE(p);
		p = leader;
		memcg = mem_cgroup_from_css(css);
	}
	if (!p)
		return 0;

	/*
	 * We are now commited to this value whatever it is. Changes in this
	 * tunable will only affect upcoming migrations, not the current one.
	 * So we need to save it, and keep it going.
	 */
	move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
	if (!move_flags)
		return 0;

	from = mem_cgroup_from_task(p);

	VM_BUG_ON(from == memcg);

	mm = get_task_mm(p);
	if (!mm)
		return 0;
	/* We move charges only when we move a owner of the mm */
	if (mm->owner == p) {
		VM_BUG_ON(mc.from);
		VM_BUG_ON(mc.to);
		VM_BUG_ON(mc.precharge);
		VM_BUG_ON(mc.moved_charge);
		VM_BUG_ON(mc.moved_swap);

		spin_lock(&mc.lock);
		mc.mm = mm;
		mc.from = from;
		mc.to = memcg;
		mc.flags = move_flags;
		spin_unlock(&mc.lock);
		/* We set mc.moving_task later */

		ret = mem_cgroup_precharge_mc(mm);
		if (ret)
			mem_cgroup_clear_mc();
	} else {
		mmput(mm);
	}
	return ret;
}

For cgroupv2 version, the code directly returns 0:

/* charge immigration isn't supported on the default hierarchy */
	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
		return 0;

If move_charge_at_immigrate=0, it directly returns 0 as well:

	/*
	 * We are now commited to this value whatever it is. Changes in this
	 * tunable will only affect upcoming migrations, not the current one.
	 * So we need to save it, and keep it going.
	 */
	move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
	if (!move_flags)
		return 0;

The issue disappears when I switch back to runc 1.1.2 or use the memfd-bind binary.

Question 1: At that time, what was consuming the memory? memfd shouldn't consume the container's memory. @lifubang @cyphar

What version of runc are you using?

master

Host OS information

NAME="EulerOS" VERSION="2.0 (SP10x86_64)" ID="euleros" VERSION_ID="2.0" PRETTY_NAME="EulerOS 2.0 (SP10x86_64)" ANSI_COLOR="0;31"

Host kernel information

Linux PaaSOM-1 4.18.0-147.5.2.14.h1050.eulerosv2r10.x86_64 #1 SMP Sun Oct 16 18:12:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Aug 18 '23 03:08 113xiaoji

At that time, what was consuming the memory? memfd shouldn't consume the container's memory.

~~I think runc init was consuming the memory.~~ ~~After we writting runc binary to memfd , it used runc init's memory.~~ ~~You can test the example in https://man7.org/linux/man-pages/man2/memfd_create.2.html.~~

Sorry, The above description is wrong.

Aug 18 '23 05:08 lifubang

Question 1: At that time, what was consuming the memory? memfd shouldn't consume the container's memory.

I have done some tests. Maybe depends on go's version. If you use go 1.20, the issue may disappear.

Which version of golang did you use when you saw this problem?

Aug 18 '23 08:08 lifubang

Question 1: At that time, what was consuming the memory? memfd shouldn't consume the container's memory.

I have done some tests. Maybe depends on go's version. If you use go 1.20, the issue may disappear.

Which version of golang did you use when you saw this problem?

I'm using Go version 1.19.6. Let me switch to version 1.20+ and try again. Thank you.

Aug 18 '23 08:08 113xiaoji

I switched the Go version to 1.20.7, but the problem still persists.

Warning  FailedStart       7s                kubelet            Error: failed to create containerd task: failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error setting cgroup config for procHooks process: unable to set memory limit to 20971520 (current usage: 23396352, peak usage: 23408640): unknown

@lifubang

Aug 18 '23 09:08 113xiaoji

I'm not sure if the cgroup configuration for procHooks is actually necessary -- we already set up the cgroups far earlier. IIRC the reason for this being added in the first place is that folks wanted to configure cgroups in hooks and we used to do a Set afterwards (which would overwrite those changes), but we removed that because it was completely unnecessary -- and I suspect this one is unnecessary as well (@kolyshkin wdyt?).

What pre-start or create-runtime hooks are you running?

It should be noted that the cgroup configuration didn't fail when the process was first configured, so there's something weird going on here. It should also be noted that the memory limit doesn't make sense -- we are just setting the same limit twice and yet the container is using more memory than the limit? Is the original limit a soft limit somehow?

Aug 20 '23 14:08 cyphar

It should be noted that the cgroup configuration didn't fail when the process was first configured, so there's something weird going on here. It should also be noted that the memory limit doesn't make sense -- we are just setting the same limit twice and yet the container is using more memory than the limit? Is the original limit a soft limit somehow? How to check, i don't think so, i think it is not a soft limit

What pre-start or create-runtime hooks are you running?

I check the config.json, there is a prestart hook

	"hooks": {
		"prestart": [{
			"path": "/var/lib/docker/hooks/remount_sys.sh",
			"args": ["remount_sys.sh"]
		}]
	},

The script has configured devices.allow, and I'm not sure if it's related to this issue.

@cyphar

Aug 20 '23 15:08 113xiaoji

I'm not sure if the cgroup configuration for procHooks is actually necessary -- we already set up the cgroups far earlier. IIRC the reason for this being added in the first place is that folks wanted to configure cgroups in hooks and we used to do a Set afterwards (which would overwrite those changes), but we removed that because it was completely unnecessary -- and I suspect this one is unnecessary as well (@kolyshkin wdyt?).

What pre-start or create-runtime hooks are you running?

It should be noted that the cgroup configuration didn't fail when the process was first configured, so there's something weird going on here. It should also be noted that the memory limit doesn't make sense -- we are just setting the same limit twice and yet the container is using more memory than the limit? Is the original limit a soft limit somehow?

Aug 21 '23 02:08 113xiaoji

t should be noted that the cgroup configuration didn't fail when the process was first configured, so there's something weird going on here. It should also be noted that the memory limit doesn't make sense -- we are just setting the same limit twice and yet the container is using more memory than the limit? Is the original limit a soft limit somehow?

This point also confuses me. How can we confirm whether the original limit is using a soft limit? I think it's highly unlikely that it is.

Aug 21 '23 02:08 113xiaoji

I'm not sure if the cgroup configuration for procHooks is actually necessary -- we already set up the cgroups far earlier. IIRC the reason for this being added in the first place is that folks wanted to configure cgroups in hooks and we used to do a Set afterwards (which would overwrite those changes), but we removed that because it was completely unnecessary -- and I suspect this one is unnecessary as well (@kolyshkin wdyt?).

The cgroups setup is split between Apply (which no longer does Set, but merely creates a cgroup and adds a pid to it) and Set (which actually sets the limits). From the cursory look at the code, we do need to call Set here.

OTOH it looks like we call Set (and run CreateRuntime hook) twice in case we're running in host mount namespace, and I can't figure out why. Alas, I see no integration tests related to host mntns. Anyway, this is orthogonal to this issue.

Aug 25 '23 03:08 kolyshkin

OTOH it looks like we call Set (and run CreateRuntime hook) twice in case we're running in host mount namespace, and I can't figure out why. Alas, I see no integration tests related to host mntns. Anyway, this is orthogonal to this issue.

Addressed by #3996.

Aug 26 '23 02:08 kolyshkin

Do you need me to provide any additional information?

Aug 26 '23 03:08 113xiaoji

Well, it is clear what's happening -- higher memory usage due to switching from bindfd to memfd.

This is being addressed in #3987. If you want to use current runc HEAD, the workaround is to raise the memory limits.

Sep 01 '23 18:09 kolyshkin

Well, it is clear what's happening -- higher memory usage due to switching from bindfd to memfd.

This is being addressed in #3987. If you want to use current runc HEAD, the workaround is to raise the memory limits.

I think there are still some points that we haven't analyzed clearly, according to the previous analysis memfd will only consume the host's memory, not the container's memory. So theoretically, if the host memory is sufficient, switching from bindfd to memfd should not cause the container to fail to start.

Sep 02 '23 03:09 113xiaoji

@113xiaoji #3987 has been merged, could you please test it to see whether you can reproduce your issue in the main branch or not. Thanks.

Sep 24 '23 12:09 lifubang

OK i will reproduce later

Sep 24 '23 21:09 113xiaoji

@113xiaoji #3987 has been merged, could you please test it to see whether you can reproduce your issue in the main branch or not. Thanks.

I have tested it with the latest main branch, without adding the runc_nodmz tag, and the issue wasn’t reproduced. However, what confuses me is that the preceding memfd indeed occupied the memory of the container?

Sep 27 '23 08:09 113xiaoji

without adding the runc_nodmz tag, and the issue wasn’t reproduced

✌️ Thanks.

what confuses me is that the preceding memfd indeed occupied the memory of the container?

Yes, this question still has no answer at this time.

Sep 27 '23 09:09 lifubang