nri-plugins balloons: add support for explicit CPUs

Introduce a new parameter (preferCpus) in Balloons policy for giving preferred CPU sets on each balloon type. When combining preferIsolCpus with preferCpus, priority will be given to isolated CPUs over the user-specified set of CPUs if these are two different sets.

Jul 10 '24 09:07 fmuyassarov

Is this a good idea or something we both want and need ? So far, we have been deliberate and careful to not allow setting explicitly CPUs by ID for workloads.

I was expecting this question :smile: To be honest I don't know weather it is good or bad idea to allow users for setting CPUs explicitly. I just had this task on my list. Perhaps @askervin has something to add.

Jul 10 '24 10:07 fmuyassarov

Is this a good idea or something we both want and need ? So far, we have been deliberate and careful to not allow setting explicitly CPUs by ID for workloads.

I was expecting this question :smile: To be honest I don't know weather it is good or bad idea to allow users for setting CPUs explicitly. I just had this task on my list. Perhaps @askervin has something to add.

So @askervin knows about this and is okay with this ? I'll wait for his review then. I'd rather not open this Pandora's box if we can avoid it in any way. Usually the user does not really want any specific CPU or cpuset by ID(s). Instead they want their workload to be topologically close to some memory controller or one or more PCI devices and that's why they ask for the ability to set specific cpusets for workloads. However, almost always it is better to provide a way for the user/workload to describe what they need ("this workload needs to be close to a PCI device of a particular vendor and class" or "this workload needs to be close to that other workload") and then let the policy in use implement the necessary means of how that is achieved. So, on UI-/UX-visible interfaces (such as workload annotations or policy configuration) we ususlly prefer things to be described in terms of what is needed not in terms of how that could be achieved on some piece of hardware, especially since the latter might not be portable across cluster nodes. Of course, there are exceptions to every rule...

Jul 10 '24 11:07 klihub

Is this a good idea or something we both want and need ? So far, we have been deliberate and careful to not allow setting explicitly CPUs by ID for workloads.

Based on the requests and wishes seen so far regarding CPU affinity, I'm leaning towards handing the gun to the user with all the big fat warnings on it (many picked from your inputs above). Use this only in case of emergency. You will lose all the nice features. The warranty is void if this seal is broken. Not for production use.

The reason is that I'm sure that also in the future there will be special cases where balloons simply is not smart enough out-of-the-box. There could be a special process or a feature or what not running/watching/tracing/enabled exactly on certain core, for instance. Or it might be that (once again) balloons would not be smart enough to fulfill a requirement as the precedence of its preferences is not configurable or something. And because there is no easy alternative for handling such special cases, it would be sad if affinity of single difficult workload would prevent using the balloons policy (and naturally the topology aware policy, too). For this kind of purposes I'd like to enable users to have this parameter in their sleave if everything else fails and there is no time to wait until balloons has proper support for their request.

But to be honest, I'm not leaning towards this direction without hesitation. I see it as a risk if anyone builds a production system that uses this option. I see it also as a risk that we might not get a feature request if someone can hack around their problem by using this feature. These are big risks. Maybe we could add to the list of warnings: "never use this parameter without telling the resource policy developers what they should implement to this policy so that you could stop using this parameter".

If you see better, I think we can postpone this feature and let @fmuyassarov work on P/E cores, which is definitely worth supporting. And adding that support immediately removes one reason to use preferCpus. :D

Jul 10 '24 17:07 askervin

Is this a good idea or something we both want and need ? So far, we have been deliberate and careful to not allow setting explicitly CPUs by ID for workloads.

Based on the requests and wishes seen so far regarding CPU affinity, I'm leaning towards handing the gun to the user with all the big fat warnings on it (many picked from your inputs above). Use this only in case of emergency. You will lose all the nice features. The warranty is void if this seal is broken. Not for production use.

The reason is that I'm sure that also in the future there will be special cases where balloons simply is not smart enough out-of-the-box. There could be a special process or a feature or what not running/watching/tracing/enabled exactly on certain core, for instance. Or it might be that (once again) balloons would not be smart enough to fulfill a requirement as the precedence of its preferences is not configurable or something. And because there is no easy alternative for handling such special cases, it would be sad if affinity of single difficult workload would prevent using the balloons policy (and naturally the topology aware policy, too). For this kind of purposes I'd like to enable users to have this parameter in their sleave if everything else fails and there is no time to wait until balloons has proper support for their request.

But to be honest, I'm not leaning towards this direction without hesitation. I see it as a risk if anyone builds a production system that uses this option. I see it also as a risk that we might not get a feature request if someone can hack around their problem by using this feature. These are big risks. Maybe we could add to the list of warnings: "never use this parameter without telling the resource policy developers what they should implement to this policy so that you could stop using this parameter".

If you see better, I think we can postpone this feature and let @fmuyassarov work on P/E cores, which is definitely worth supporting. And adding that support immediately removes one reason to use preferCpus. :D

@askervin If you tell me that we need this, and that we can paint it with big enough "dontcha shoot yourself in the foot" warnings, then I trust your judgement and am fine with it, too.

Jul 12 '24 10:07 klihub

Let's "sleep over" with that configuration option. I agree that giving exact cpuset to the user might be usable in some rare scenarios, as well, it would lead to many of corner cases: e.g. multiple balloons that have overlapping cpusets, or in some cpus being allocated to other ballons in normal policy way if this balloon can be shrink to 0, etc,etc... What we can use as a potential "middle ground" is the option for balloon that would specify multiple cpusets in priority order, so normal cpu allocator would be using those more as hint rather than explicit list... but again, that might cause many other corner cases. So, let's get back to "drawing board" with this for near future?

Jul 12 '24 11:07 kad

Let's "sleep over" with that configuration option. I agree that giving exact cpuset to the user might be usable in some rare scenarios, as well, it would lead to many of corner cases: e.g. multiple balloons that have overlapping cpusets, or in some cpus being allocated to other ballons in normal policy way if this balloon can be shrink to 0, etc,etc... What we can use as a potential "middle ground" is the option for balloon that would specify multiple cpusets in priority order, so normal cpu allocator would be using those more as hint rather than explicit list... but again, that might cause many other corner cases. So, let's get back to "drawing board" with this for near future?

I am totally on board with that, if that is good enough for @askervin.

Jul 12 '24 12:07 klihub

Let's mark this as a draft, while we sleep/sit on it.

Jul 16 '24 10:07 klihub

But to be honest, I'm not leaning towards this direction without hesitation. I see it as a risk if anyone builds a production system that uses this option. I see it also as a risk that we might not get a feature request if someone can hack around their problem by using this feature. These are big risks. Maybe we could add to the list of warnings: "never use this parameter without telling the resource policy developers what they should implement to this policy so that you could stop using this parameter".

Hi, I have a usecase, where we limit certain non-k8s pod workloads on our clusters (systemd services like kube-api, kube-scheduler and other "infra" services) to only use a certain CPU set. the goal is to have "infra" and "workload" completely separate using balloons, and I don't see a way to create that separation without this feature. Would this be something balloon policy is able to do? or do you have an idea on a feature that is less dangerous that would allow us to do this (I don't mind coding it, but I don't have any other idea on how to solve it)?

optionally, would an "excludeCPUs" option to list CPUs to exclude from balloon scheduling for host workloads that are not pods

tl;dr - we have cpu cores where we don't want workloads running (generally would prefer allowing certain workloads to run and disallowing others)

P.S this is my first time commenting, please tell me if this comment doesn't fit here and where I should be asking this

Jul 17 '24 07:07 NoamNakash

@NoamNakash Sorry for the radio silence, most of the people were and are still on vacation and thus there was no response. Thank you for sharing your use case. I'm still leaning towards having this option available but I can't decide on my own since this is not my project and will wait for others to come back from the vacation and decide what do we do wit this feature and potential user requests like yours.

Jul 29 '24 06:07 fmuyassarov

@noamnakash, sorry about the delay. After quite a few discussions we ended up with a decision not to implement this feature, at least not at this point. I will elaborate this in a separate comment.

For your particular use case, that is separating non-k8s processes on dedicated CPUs, there is an option called availableResources. It definitely was not easy to find, as it was not documented among other balloons options, so I added it in PR #355. This option is not a perfect match, as it lists CPUs available for the policy instead of CPUs not available to the policy.

Do you still think it works for you?

Aug 23 '24 11:08 askervin

We have decided not to implement this feature for now.

The reason is that preferring explicitly listed CPUs invites users to misuse this policy in non-scalable and error prone ways that will lead into suboptimal CPU affinity. This option also breaks our design principle by turning the policy configuration from descriptive what kind of CPUs are needed into imperative which CPUs are needed.

This said, we acknowledge that the policy will never offer all possible options to tell what kind of CPUs are needed. If you find a missing option, we invite you to share your use case and possibly propose options that would help solving that.

Until the missing option is not integrated, or you wish to use the balloons or the topology-aware policy to manage your Kubernetes workloads except for few special Kubernetes workloads that you want to run on explicitly listed CPUs, we advice you to use the following workaround.

Do not allow the policy to manage the explicitly listed CPUs. Exclude those CPU from the availableResources in the policy configuration.
Annotate your special workloads so that the policy will not touch their CPU or memory pinning. Use *.preserve.* annotations that are honored by both balloons and topology-aware policies.
Use a NRI plugin that will set the CPU affinity of your special workloads into this special set of CPUs. You can write your own NRI plugin for this, or you can use the memory-qos plugin to pass through cpuset.cpus values among other unified annotations directly to cgroups v2. (Using the memory-qos plugin will not work for cgroups v1.)

Aug 26 '24 08:08 askervin