libct: fix resetting CPU affinity
unix.CPUSet is limited to 1024 CPUs. Calling unix.SchedSetaffinity(pid, cpuset) removes all CPUs starting from 1024 from allowed CPUs of pid, even if cpuset is all ones. The consequence of runc trying to reset CPU affinity by default is that it prevents all containers from using those CPUs.
This change is a quick fix that brings runc behavior back to what it was in v1.3.0 in 1024+ CPU systems. Real fix requires calling sched_setaffinity with cpusetsize fitting all CPUs in the system, which cannot be done with current unix.SchedSetaffinity.
Fixes: #5023
how about @askervin
if runtime.NumCPU() > 1024 {
return
}
if runtime.NumCPU() > 1024 { return }
NumCPU() returns the number of CPUs usable by the current process. The purpose of the tryResetCPUAffinity() is to make that number bigger, just in case an external entity has made it smaller than enabled CPUs in the whole system. Now assume that the external entity has set affinity to cpuset 1023-1122, giving NumCPUs()==100, the logic would continue to SchedSetaffinity(pid, cpuset(0-1023)), and allow using only 1 CPU, namely 1023.
Besides, I would avoid adding any magic values (like 1024) to this logic. If, for instance, unix.CPUSet would be changed to [64]uint64 instead of current [16]uint64, the current workaround calling SchedGetaffinity() would start working 4096-CPU systems, or if unix.SchedGetaffinity/SchedSetaffinity would be updated to work with dynamic (large enough) cpuset sizes, then this fix would work as is. With magic numbers we would have introduced only a new place that needs to be fixed at some point.
I'm not sure this completely fixes #5023 -- yes, it stops the reset issue but still leaves you with the same problem that #4858 was trying to solve. If you are explicitly requesting CPUs >= 1024 then you will still not be able to get them AFAICS because we still use unix.SchedSetaffinity which doesn't support CPUs >= 1024 (this appears to be what #5023 is talking about, but the reset case could also cause problems if you are forcing the affinity using cgroups with cpuset).
I think we should just be calling sched_setaffinity directly. We can either just call it directly for this one case (i.e., pass an array of 0xFF that is "long enough" -- the current upstream kernel maximum is 8192 CPUs which would be an 128-long uint64 array) or we can copy the code from golang.org/x/sys and make it use slices instead of a fixed-size array so that we can support larger CPU values for the explicit CPU affinity configuration.
For what it's worth, I think even the current behaviour of resetting to use the first 1024 CPUs by default is better than regressing #4858.
@cyphar, I think you're right: why making a quick fix when the proper fix is not really that much harder.
Updated. Playing as safe as the latest go runtime in the size of the CPU mask.
Hmm, should we try to fix unix.CPUSet instead?
Hmm, should we try to fix unix.CPUSet instead?
I'm afraid trying to make unix.CPUSet dynamic will break lots of code. At least I can't come up with any nice change that would make it just work without modifications in applications. Yet the code that uses it is already broken in 1024+ CPU systems, at least trivial changes that would make it dynamic, can easily break it in smaller systems.
One possibility could be adding unix.CPUSetS, following the spirit in the C library and macros CPU_SET_S, CPU_CLEAR_S, CPU_ZERO_S, ... That are dynamic size variants of simple CPU_SET, CPU_CLEAR, CPU_ZERO, ... macros. Maybe we could implement CPUSetS with an interface as similar to CPUSet as possible, and use it.
We could perhaps practice it in runc internal or libcontainer, and then propose to unix.CPUSetS when we are happy and runc works fine?
How this would sound to you?
A prototype of CPUSetS is in the topmost commit (WIP: dynamic cpuset) in this branch: https://github.com/askervin/runc/tree/5eU-cpuset