sriov-network-device-plugin Flag to not advertise NUMA information

Flag to not advertise NUMA information

Open blackgold opened this issue 3 years ago • 11 comments

What would you like to be added?

Flag to not advertise NUMA information

What is the use case for this feature / enhancement?

Logic to generate placement hints in topology manager is exponential to number of numa cores. When we have like 8 nodes it takes really long.
We are not using numa information for rdma, so if device plugin does not send it (configurable by flag)it will be helpful.

Jan 29 '21 21:01 blackgold

@blackgold This is really interesting - have you got numbers for how long the TM calculation is taking? Does it impact your container startup time? It would be really helpful to understand the impact of the Topology calculation.

Jan 31 '21 13:01 killianmuldoon

@blackgold I assume you also have other device plugin instances running in the same cluster that requires NUMA advertising, correct? so disabling NUMA policy in kubelet is not an option here.

Feb 01 '21 01:02 zshi-redhat

@blackgold This is really interesting - have you got numbers for how long the TM calculation is taking? Does it impact your container startup time? It would be really helpful to understand the impact of the Topology calculation.

It takes more than 20 minutes. Jobcontroller kills the jobs in pending state for more than 20 mins after binding to node. I will try to add some logs in kubelet to time it.

Feb 01 '21 16:02 blackgold

@blackgold I assume you also have other device plugin instances running in the same cluster that requires NUMA advertising, correct? so disabling NUMA policy in kubelet is not an option here.

Ack. we have gpu device plugin advertising topology information so cannot disable it in kubelet. Jobs requiring less than 8 gpu's don't request rdma resources.So we need it enabled in kubelet for this case.

Feb 01 '21 16:02 blackgold

@blackgold Is this an 8 NUMA zone node? I didn't realize the Topology Manager calculation could take so long - any extra information on the set up and config would be great.

@zshi-redhat this seems like a must-have for these sorts of situations. Do you think it would work as a cmd flag i.e. daemonset wide(but not necessarily cluster wide) , or would it be better to have it as a per-pool config (would allow TM active for SRIOV on some pools but not on others)

Feb 01 '21 17:02 killianmuldoon

Yup its a 8 NUMA zone node, 8 gpu, 8 RDMA devices and 255 cpus.

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/topologymanager/policy.go#L142 Here the size of allProviderHints is 10x239.

When I timed it took 220 seconds to generate permutations from [8,0] to [8,96] ~= 22944 function calls. 220 seconds seems a lot for those many function calls. Need to debug more.

Feb 01 '21 18:02 blackgold

@zshi-redhat this seems like a must-have for these sorts of situations. Do you think it would work as a cmd flag i.e. daemonset wide(but not necessarily cluster wide) , or would it be better to have it as a per-pool config (would allow TM active for SRIOV on some pools but not on others)

I think having a per-pool config would allow more flexibility and ultimately solve any relevant issues. For example, running one device plugin instance would be possible for several resource pools, with NUMA enabled for some pools but not the others. If we only have cli option, then user would need run multiple instances of device plugin, with each using different NUMA cli config.

For this particular case, my understanding is GPU is advertised by a different device plugin (may not be sriov), so having a cli option would be enough.

Feb 02 '21 01:02 zshi-redhat

was an issue filed against topology manager ? maybe the algorithm can be improved

Feb 02 '21 07:02 adrianchiris

was an issue filed against topology manager ? maybe the algorithm can be improved not yet. @klueska

If you guys think its reasonable to control this using a cli option i can send out a mr

Feb 03 '21 01:02 blackgold

was an issue filed against topology manager ? maybe the algorithm can be improved not yet. @klueska

If you guys think its reasonable to control this using a cli option i can send out a mr

I'm fine with using a cli option, this is aligned with the discussion we had in https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/issues/320 and resource mgmt meeting - to have a featureGate for features that may need to be enabled/disabled. I think numa could be one example of such.

/cc @killianmuldoon @ahalim-intel @adrianchiris @martinkennelly

Feb 03 '21 02:02 zshi-redhat

@zshi-redhat I think a feature gate is a good idea here for sure, but we should think about implementing per-pool numa-awareness (default on, opt out for a specific pool) for advanced cases where sriov topology may not be important (one NIC per node, multi-resource NUMA contstraints).

Feb 03 '21 11:02 killianmuldoon

sriov-network-device-plugin sriov-network-device-plugin copied to clipboard

Flag to not advertise NUMA information

What would you like to be added?

What is the use case for this feature / enhancement?

sriov-network-device-plugin
sriov-network-device-plugin copied to clipboard