koordinator icon indicating copy to clipboard operation
koordinator copied to clipboard

add the end-to-end solution of RDMA devices

Open ferris-cx opened this issue 4 months ago • 0 comments

Ⅰ. Describe what this PR does

Added end-to-end support for rdma devices, including device discovery, device registration, node resource update, scheduling, and allocation

Ⅱ. Does this pull request fix one issue?

  1. Repair the function loss problem discovered and reported by the end side of the rdma device

  2. Fix busId missing of PF after assignment

  3. Fix the injection container problem of pf\vf equipment

Ⅲ. Describe how to verify it

  1. In the k8s cluster, prepare one or more servers that support rdma NIC devices as cluster nodes. Install the koordlet component of a new version on each node, and check the status of a node. The number of resources is displayed as the actual number of RDMA nics on the node.

  2. Write a pod to apply for RDMA network card resources, kubectl apply-f pod.yaml, and view the scheduling result: Pod on the annotation view key name for the scheduling koordinator. Sh/device - allocated the value, the value will contain rdma allocation results.

Ⅳ. Special notes for reviews

Due to the complete end-to-end runthrough, it also needs the cooperation of multus-cni plug-in, which follows the CNI specification and supports multi-NIC allocation. The PF/VF assignment of the RDMA network card mentioned here to the Pod requires the device ID to be injected into the component, otherwise it will not run. This change will be maintained separately in the multus-cni project or in another PR

V. Checklist

  • [ ] I have written necessary docs and comments
  • [ ] I have added necessary unit tests and integration tests
  • [ ] All checks passed in make test

ferris-cx avatar Oct 10 '24 06:10 ferris-cx