sriov-network-device-plugin icon indicating copy to clipboard operation
sriov-network-device-plugin copied to clipboard

Support for allocating all VFs from a single PF (bin packing)

Open sseetharaman6 opened this issue 4 years ago • 15 comments

What would you like to be added?

If I have a multiple PFs configured for SRIOV and advertised as the same resource pool (sriov_foo) , is it possible to enforce allocation of all VFs from a single PF before VFs from other PFs are allocated? It seems like pluginapi.AllocateRequest is picking devicesIDs at random, so I am not sure if this is possible/ can be supported.

What is the use case for this feature / enhancement?

sseetharaman6 avatar Jul 21 '20 06:07 sseetharaman6

@sseetharaman6 you're right that kubelet randomly chooses one healthy device from the advertised pool (sriov_foo), so if all VFs from PFs are grouped as one pool, then it's not guaranteed which PF the allocated VF is from. you might want to group the VFs from single PF as one pool and request device directly from that pool.

zshi-redhat avatar Jul 21 '20 06:07 zshi-redhat

Yea, but say I have 2 VFs per PF and request for 3 VFs in the pod spec , advertising each PF as its own resource will make this pod unschedulable. In order to allocate all VFs from a PF before moving on to the next, DP has to support some kind of resource ordering or preferential allocate (can something like https://github.com/kubernetes/enhancements/pull/1121 be used? )

sseetharaman6 avatar Jul 22 '20 20:07 sseetharaman6

Yea, but say I have 2 VFs per PF and request for 3 VFs in the pod spec , advertising each PF as its own resource will make this pod unschedulable.

In this case, you will need to put two resource requests in the pod spec, the first request 2 VF resource, the second request 1 VF resource. I understand this may not be exactly what you have asked for.

In order to allocate all VFs from a PF before moving on to the next, DP has to support some kind of resource ordering or preferential allocate (can something like kubernetes/enhancements#1121 be used? )

Thanks for linking the reference! First of all, I think we should update the device plugin to support this new interface GetPreferredAllocation change. Regarding how device plugin shall decide the preferred allocation, my understanding is that if may differ per use cases. For example, sometime user may want to distribute the workloads to different PFs to balance the load on each interface. in other cases like you mentioned, it may be preferred to consume all resource from single PF before using the next one. It looks to me that we may not have a unified solution on how device plugin shall decide the preferred allocation. but maybe it is possible to define several preferred allocating polices and let user to choose which one to apply when launching the device plugin.

zshi-redhat avatar Jul 22 '20 23:07 zshi-redhat

Facing same +1

RahulG115 avatar Aug 06 '20 09:08 RahulG115

@zshi-redhat we should be able to implement this on a per-pool level with some device pools "packers" and others marked as "spreaders". Is there anything else the preferred allocation could be used for that might fit in - or be more relevant even?

killianmuldoon avatar Aug 06 '20 09:08 killianmuldoon

@zshi-redhat we should be able to implement this on a per-pool level with some device pools "packers" and others marked as "spreaders". Is there anything else the preferred allocation could be used for that might fit in - or be more relevant even?

@killianmuldoon I think we could have two, as you already mentioned, one for allocating the VFs evenly across multiple PFs (in the same pool), the other for allocating all VFs from one PF until it's exhausted, then the next PF.

zshi-redhat avatar Aug 06 '20 12:08 zshi-redhat

@zshi-redhat - this approach makes sense to me. is there work underway to add interface for GetPreferredAllocation ?

sseetharaman6 avatar Aug 10 '20 18:08 sseetharaman6

@zshi-redhat - this approach makes sense to me. is there work underway to add interface for GetPreferredAllocation ?

I do not think anyone is working on this. It will be discussed at the next meeting of network and resource mgnt.

martinkennelly avatar Aug 11 '20 13:08 martinkennelly

@zshi-redhat - this approach makes sense to me. is there work underway to add interface for GetPreferredAllocation ?

I do not think anyone is working on this. It will be discussed at the next meeting of network and resource mgnt.

Update: this was discussed on Monday meeting, we agreed to support this new API update in sriov device plugin. However, this is not currently assigned to anyone, please feel free to take it if you have interest working on this.

zshi-redhat avatar Aug 19 '20 08:08 zshi-redhat

@sseetharaman6 FYI, this feature is added via PR #267 if you'd like to do some testing or have any suggestions.

zshi-redhat avatar Sep 10 '20 01:09 zshi-redhat

First scenario: I have two PFs (PF-A、PF-B),and i define two resources (R-A、R-B). Then, I create the pod resquestes two resources (R-A:1,R-B:1). Second scenario: I have two PFs (PF-A、PF-B),and i define one resources (R). Then, I create the pod resquestes the resource(R:2), and the kubelet allocate the two VFs in the single PF-A or PF-B. I wonder to know whether there is some difference between this two scenarioes for pod network. For example, which one is best for deep Learning (tensorflow、pytorch and so on ). Thanks!

qingshanyinyin avatar Sep 02 '21 00:09 qingshanyinyin

I wonder to know whether there is some difference between this two scenarioes for pod network.

if you need to have two additional network interfaces for the Pod, configured by a supporting CNI plugin then IIRC only the second scenario will work.

if you just want to have two VFs allocated to the pod (and no CNI conifg required) then sending traffic from different PFs (different uplinks) would probably be faster.

there is also another consideration which affects performance, the NUMA alignment for memory, CPU and PCI. in this case you would want all to be aligned.

adrianchiris avatar Sep 02 '21 08:09 adrianchiris

if you need to have two additional network interfaces for the Pod, configured by a supporting CNI plugin then IIRC only the second scenario will work.

For first scenario, couldn't you just define two NADs (net-a, net-b) with associated DP selectors (PFName) each selecting an individual PF? Then in your network request annotation put in net-a and net-b. You get VF from each PF then. What am I missing?

martinkennelly avatar Sep 02 '21 10:09 martinkennelly

if you need to have two additional network interfaces for the Pod, configured by a supporting CNI plugin then IIRC only the second scenario will work.

correction, i meant first scenario. having two network-attachment-definition each associated with a different resource will work. having both network-atttachment-definition associated with the same resource (i think) will not work.

since multus would need to provide each attachment with a different DeviceID from same resource on CmdAdd call. (i.e pass to delegate CNI first device ID on first call and second device ID for second call )

adrianchiris avatar Sep 02 '21 10:09 adrianchiris

I have sovle it First scenario! Thanks! @adrianchiris @martinkennelly Now i need to do annother task! I will define only one resource for different PFs(8 or more), and i will make kubelet to allocate VFs from each PFs. For example: request, sriov-resource: 1 allocate, VFs:8(if there are 8 PFs in the node, and 8VFs is from different PFs!) I wonder to know whether this will be ok? if i do not edit the multus and only edit the sriov-device-plugin.

qingshanyinyin avatar Sep 09 '21 00:09 qingshanyinyin