sriov-network-device-plugin icon indicating copy to clipboard operation
sriov-network-device-plugin copied to clipboard

Flag to not advertise NUMA information

Open blackgold opened this issue 3 years ago • 11 comments

What would you like to be added?

Flag to not advertise NUMA information

What is the use case for this feature / enhancement?

Logic to generate placement hints in topology manager is exponential to number of numa cores. When we have like 8 nodes it takes really long.
We are not using numa information for rdma, so if device plugin does not send it (configurable by flag)it will be helpful.

blackgold avatar Jan 29 '21 21:01 blackgold

@blackgold This is really interesting - have you got numbers for how long the TM calculation is taking? Does it impact your container startup time? It would be really helpful to understand the impact of the Topology calculation.

killianmuldoon avatar Jan 31 '21 13:01 killianmuldoon

@blackgold I assume you also have other device plugin instances running in the same cluster that requires NUMA advertising, correct? so disabling NUMA policy in kubelet is not an option here.

zshi-redhat avatar Feb 01 '21 01:02 zshi-redhat

@blackgold This is really interesting - have you got numbers for how long the TM calculation is taking? Does it impact your container startup time? It would be really helpful to understand the impact of the Topology calculation.

It takes more than 20 minutes. Jobcontroller kills the jobs in pending state for more than 20 mins after binding to node. I will try to add some logs in kubelet to time it.

blackgold avatar Feb 01 '21 16:02 blackgold

@blackgold I assume you also have other device plugin instances running in the same cluster that requires NUMA advertising, correct? so disabling NUMA policy in kubelet is not an option here.

Ack. we have gpu device plugin advertising topology information so cannot disable it in kubelet. Jobs requiring less than 8 gpu's don't request rdma resources.So we need it enabled in kubelet for this case.

blackgold avatar Feb 01 '21 16:02 blackgold

@blackgold Is this an 8 NUMA zone node? I didn't realize the Topology Manager calculation could take so long - any extra information on the set up and config would be great.

@zshi-redhat this seems like a must-have for these sorts of situations. Do you think it would work as a cmd flag i.e. daemonset wide(but not necessarily cluster wide) , or would it be better to have it as a per-pool config (would allow TM active for SRIOV on some pools but not on others)

killianmuldoon avatar Feb 01 '21 17:02 killianmuldoon

Yup its a 8 NUMA zone node, 8 gpu, 8 RDMA devices and 255 cpus.

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/topologymanager/policy.go#L142 Here the size of allProviderHints is 10x239.

When I timed it took 220 seconds to generate permutations from [8,0] to [8,96] ~= 22944 function calls. 220 seconds seems a lot for those many function calls. Need to debug more.

blackgold avatar Feb 01 '21 18:02 blackgold

@zshi-redhat this seems like a must-have for these sorts of situations. Do you think it would work as a cmd flag i.e. daemonset wide(but not necessarily cluster wide) , or would it be better to have it as a per-pool config (would allow TM active for SRIOV on some pools but not on others)

I think having a per-pool config would allow more flexibility and ultimately solve any relevant issues. For example, running one device plugin instance would be possible for several resource pools, with NUMA enabled for some pools but not the others. If we only have cli option, then user would need run multiple instances of device plugin, with each using different NUMA cli config.

For this particular case, my understanding is GPU is advertised by a different device plugin (may not be sriov), so having a cli option would be enough.

zshi-redhat avatar Feb 02 '21 01:02 zshi-redhat

was an issue filed against topology manager ? maybe the algorithm can be improved

adrianchiris avatar Feb 02 '21 07:02 adrianchiris

was an issue filed against topology manager ? maybe the algorithm can be improved not yet. @klueska

If you guys think its reasonable to control this using a cli option i can send out a mr

blackgold avatar Feb 03 '21 01:02 blackgold

was an issue filed against topology manager ? maybe the algorithm can be improved not yet. @klueska

If you guys think its reasonable to control this using a cli option i can send out a mr

I'm fine with using a cli option, this is aligned with the discussion we had in https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/issues/320 and resource mgmt meeting - to have a featureGate for features that may need to be enabled/disabled. I think numa could be one example of such.

/cc @killianmuldoon @ahalim-intel @adrianchiris @martinkennelly

zshi-redhat avatar Feb 03 '21 02:02 zshi-redhat

@zshi-redhat I think a feature gate is a good idea here for sure, but we should think about implementing per-pool numa-awareness (default on, opt out for a specific pool) for advanced cases where sriov topology may not be important (one NIC per node, multi-resource NUMA contstraints).

killianmuldoon avatar Feb 03 '21 11:02 killianmuldoon