runc icon indicating copy to clipboard operation
runc copied to clipboard

Set temporary single CPU affinity before cgroup cpuset transition.

Open cclerget opened this issue 2 years ago • 23 comments

This handles a corner case when joining a container having all the processes running exclusively on isolated CPU cores to force the kernel to schedule runc process on the first CPU core within the cgroups cpuset.

The introduction of the kernel commit 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76 has affected this deterministic scheduling behavior by distributing tasks across CPU cores within the cgroups cpuset. Some intensive real-time application are relying on this deterministic behavior and use the first CPU core to run a slow thread while other CPU cores are fully used by real-time threads with SCHED_FIFO policy. Such applications prevents runc process from joining a container when the runc process is randomly scheduled on a CPU core owned by a real-time thread.

  • Fixes #3922

cclerget avatar Jun 30 '23 13:06 cclerget

Thanks for working on this. I changed it to draft until all the issues with the code and test cases are fixed, and left some minor comments.

kolyshkin avatar Jun 30 '23 20:06 kolyshkin

Thanks @kolyshkin , it should be ready for review now

cclerget avatar Jul 03 '23 14:07 cclerget

@cclerget Please rebase

lifubang avatar Jul 07 '23 09:07 lifubang

@lifubang Done

cclerget avatar Jul 11 '23 08:07 cclerget

@cclerget I think this patch is helpful to solve the issue #3922 .

I also think that the existence of this patch has its meaning. Maybe in most situation, people may need the behavior of this kernel's patch, but after your patch, people have no chance to let the kernel distribute tasks to different cpus, because your patch masks the kernel's patch.

lifubang avatar Jul 16 '23 00:07 lifubang

@lifubang I understand your concerns, what if I restrict the scope of the patch by calling getEligibleCPU only when kernel.sched_rt_runtime_us has -1 value ? When set to -1, it gives zero room for kernel to switch in kernel space on a CPU where a real-time application is running on (under scheduler real-time policies) and do not make any context switch.

This setting value is part of the problem that I'm attempting to address, when all tasks (except the slow-thread) are running under a scheduler real-time policy within a cgroup cpuset, the kernel patch will arbitrary pin runc process to one of the CPU during cgroup transition, if runc is not pinned to the CPU running the slow-thread (usually running on the first CPU of the cpuset for a deterministic placement), runc will get stuck and won't be able to enter in container with the side effect of locking down the system by acquiring a lock in cgroup subsystem during transition, the lock is not released until the runc process has been placed in scheduler CPU queue, which do not happen in this case until the real-time thread exit or make a context switch (not the case).

So it might make sense to also check this value and only call getEligibleCPU for this specific case. This should drastically reduce the scope of the patch, as many setup won't use this setting, and people using it are aware of consequences, so I think this patch will be required to restore the previous deterministic task placement of the kernel in such setup where this kernel patch is not appropriate, it is useful in most situations but not this one where random task placement has bad consequences

cclerget avatar Jul 16 '23 04:07 cclerget

what if I restrict the scope of the patch by calling getEligibleCPU only when kernel.sched_rt_runtime_us has -1 value ?

I think it's a good idea.

lifubang avatar Jul 16 '23 05:07 lifubang

Changes applied

cclerget avatar Jul 16 '23 06:07 cclerget

Changes applied

@cclerget Thanks. But sorry, you should rebase to resolve conflicts.

lifubang avatar Jul 17 '23 14:07 lifubang

@lifubang done

cclerget avatar Jul 17 '23 16:07 cclerget

@lifubang I realized that I forgot to remove the print line at https://github.com/opencontainers/runc/commit/a3c5576990cd33c7fc0fa6500f30c771af8ff501#diff-1cdede3fe918430ca0b7e4056abc49abca8a2925c6770ec1951befa6091f7a94R2386, I just removed it with this change https://github.com/opencontainers/runc/compare/a3c5576990cd33c7fc0fa6500f30c771af8ff501..00b21903d5c44c0b5220ca95375f675a2a08891a, sorry about that

cclerget avatar Jul 19 '23 07:07 cclerget

Thanks, @kolyshkin PTAL

lifubang avatar Jul 19 '23 09:07 lifubang

Need rebase @cclerget

lifubang avatar Aug 05 '23 09:08 lifubang

@opencontainers/runc-maintainers PTAL @cclerget has rebased many times. I think this patch is very important for some real-time nodes.

lifubang avatar Aug 05 '23 09:08 lifubang

@cclerget I have no time to dive into the code design, do you know there is a way to move all your code from c to golang?

lifubang avatar Sep 07 '23 01:09 lifubang

@lifubang Will look if I can move that into setnsProcess.start

cclerget avatar Sep 07 '23 13:09 cclerget

@lifubang I pushed change for a go only version https://github.com/opencontainers/runc/pull/3923/commits/06e7e3ba76fdf37f83d9e9e5e15f8bebcadbf8cb, I left previous commit just in case, notable change is to set CPU affinity right before executing the runc init process and use a "garbage" os thread for that purpose to simplify the logic

cclerget avatar Sep 07 '23 16:09 cclerget

Thanks, overall still LGTM. @cclerget Could you please add an integration test to read and check cpuAffinity value in the container with some mock data in https://github.com/opencontainers/runc/blob/main/libcontainer/integration/exec_test.go? @AkihiroSuda Could you review this solution to check whether this is still LGTY or not before @cclerget squashing all the commits.

lifubang avatar Sep 09 '23 00:09 lifubang

Can you explain more what does "temporary" mean here ? If we called sched_setaffinity manually to the first CPU i suppose we are working around the random distribution of tasks and the affinity to first CPU is permanent, am i understanding this correctly ?

dqminh avatar Sep 11 '23 14:09 dqminh

@lifubang

Could you please add an integration test to read and check cpuAffinity value in the container with some mock data in https://github.com/opencontainers/runc/blob/main/libcontainer/integration/exec_test.go?

Now the question is, do you just want the case where there is no eligible CPU to ensure to not mess up with affinity for most case, or do you also want to test the case for an eligible CPU by mocking /sys/devices/system/cpu/isolated and /proc/sys/kernel/sched_rt_runtime_us ? For the latter case I could add an fs.FS to libcontainer.Process, if it sounds reasonable ?

@dqminh The call to sched_setaffinity just set the CPU affinity mask for a short period hence "temporary", when the runc process is placed in the container cgroup it is joining, cgroup set (or reset) the CPU affinity mask for the process to cpuset.cpus value, the process still run on the first CPU set until the kernel scheduler decide to move it to another CPU of cpuset.cpus list if necessary and possible, this restores the same behavior as before the kernel patch.

cclerget avatar Sep 12 '23 06:09 cclerget

For the latter case I could add an fs.FS to libcontainer.Process, if it sounds reasonable ?

I have a strange idea:

  1. mount -o bind ./sched_rt_runtime_us /proc/sys/kernel/sched_rt_runtime_us
  2. mount -o bind ./isolated /sys/devices/system/cpu/isolated
  3. integration test
  4. umount /proc/sys/kernel/sched_rt_runtime_us
  5. umount /sys/devices/system/cpu/isolated

lifubang avatar Sep 13 '23 08:09 lifubang

@lifubang the hacky way yeah, that should do it, hoping to not interfere with other tests running in parallel in the future, there is few chance but we never know

cclerget avatar Sep 13 '23 10:09 cclerget

One other observation -- you add some code to the first commit, when remove most of it in the second one. Perhaps you can rework this to avoid doing that, as otherwise it's hard to review commit by commit.

Sorry about that, so I squashed the two commits and I'm pushing separate commits for each review rounds.

An additional note, my PR includes a similar fix to https://github.com/opencontainers/runc/pull/4041 which should work for all kernel, I don't know when the patch referred here to reset the CPU affinity is going to be merged in kernel upstream, but even if it is, it won't fill the gap for a range of kernel version since 6.2

cclerget avatar Feb 16 '24 12:02 cclerget

Hello @kolyshkin @lifubang @AkihiroSuda , any chance to get it reviewed please ?

cclerget avatar Mar 13 '24 14:03 cclerget

@AkihiroSuda : Any idea when this PR will be merged and released to use it ?

arun2arunraj avatar Mar 13 '24 14:03 arun2arunraj

This PR is still marked as a draft, so this is not considered ready to be merged

AkihiroSuda avatar Mar 13 '24 15:03 AkihiroSuda

One other observation -- you add some code to the first commit, when remove most of it in the second one. Perhaps you can rework this to avoid doing that, as otherwise it's hard to review commit by commit.

@kolyshkin @AkihiroSuda @cclerget Sorry for delay reply, this is becuase https://github.com/opencontainers/runc/pull/3923#issuecomment-1709337514 . @cclerget wanted us to choose which one is better? C or Go?

lifubang avatar Mar 16 '24 01:03 lifubang

Hello @kolyshkin @lifubang @AkihiroSuda, I marked the PR as ready for review, the code is actually well covered by various tests, the feature is disabled by default and can be enabled only with this new kernel param such that it doesn't impact the current behavior.

I would also like to reiterate, there is also another important fix similar to https://github.com/opencontainers/runc/pull/4041 but covering all kernel since kernel 6.2. To give you an idea of the issue, you can add this test to the main branch and run the tests on RHEL 9 or on a system with a kernel >= 6.2 and see the container process assigned to the first CPU only instead of all when runc is executed under specific CPU affinity.

I hope we could move forward quickly on this PR once you are all agree.

cclerget avatar Mar 20 '24 11:03 cclerget

Please squash the commits when this PR is ready to be merged

AkihiroSuda avatar Mar 27 '24 05:03 AkihiroSuda

Ok I've pushed the requested changes, I haven't applied changes related to your comments https://github.com/opencontainers/runc/pull/3923#pullrequestreview-1962335750 and https://github.com/opencontainers/runc/pull/3923#pullrequestreview-1962330692, please let me know if you disagree with my comments, otherwise I'll go ahead and squash commits

cclerget avatar Mar 27 '24 10:03 cclerget