volcano RFVE: Support GPU topology

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature /area scheduling /priority important-soon

Description:

GPU topology is important to the performance of running tasks, it necessary to imporve both scheduler and kubelet for GPU topology.

Dec 19 '19 02:12 k82cn

/cc @carmark @Jeffwan

Dec 19 '19 02:12 k82cn

I found one relevent project on this year's google summer of code project. https://summerofcode.withgoogle.com/archive/2019/projects/6336863634194432/

Dec 20 '19 08:12 Rui-Tang

@k82cn @Rui-Tang You may refer this project as a example.

Dec 20 '19 08:12 carmark

@carmark , thanks very much for the info :) I'd like to build something based on that example :)

Dec 24 '19 08:12 k82cn

/kind rfve

Dec 24 '19 08:12 k82cn

I was taking days off recently and didn't get a chance to attend the meeting.

Topology Manager integration with device plugin is in the latest kubernetes. Do we plan to leverage this or do something different? https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-integration-with-the-topology-manager

Dec 26 '19 22:12 Jeffwan

@Jeffwan The official topology manager for Device Plugin is just for the devices with socket information. But for different devices and topology, we may need more information and more detailed scheduler for better performance, such as GPUs, select the two GPUs in a Socket or in a PIX or in a PHB?

Dec 27 '19 01:12 carmark

@Jeffwan The official topology manager for Device Plugin is just for the devices with socket information. But for different devices and topology, we may need more information and more detailed scheduler for better performance, such as GPUs, select the two GPUs in a Socket or in a PIX or in a PHB?

Yeah. It makes sense.

Dec 27 '19 04:12 Jeffwan

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

Aug 18 '20 06:08 stale[bot]

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

Oct 17 '20 07:10 stale[bot]

/feature

Apr 16 '21 07:04 Thor-wl

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

Jul 15 '21 09:07 stale[bot]

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

Oct 13 '21 12:10 stale[bot]

How is this issue going?

Dec 22 '21 04:12 tianzichenone

How about using nvml.DeviceGetTopologyCommonAncestor to build gpu tree

Dec 28 '21 10:12 jiangxiaobin96

And I don't think using device-plugin to build gpu tree is a good way. Although device-plugin can use with volcano cooperatively, we still need to judge if the gpu resource in node is enough in function predicate. So why not build gpu tree in volcano directly and device-plugin just need to bind specific one gpu.

Dec 28 '21 10:12 jiangxiaobin96

And I don't think using device-plugin to build gpu tree is a good way. Although device-plugin can use with volcano cooperatively, we still need to judge if the gpu resource in node is enough in function predicate. So why not build gpu tree in volcano directly and device-plugin just need to bind specific one gpu.

nvml needs to run on a specific node and then get topo info, how to build a topo tree in volcano directly？

Dec 28 '21 10:12 shinytang6

How is this issue going?

Not so much, https://github.com/volcano-sh/volcano/pull/1779 does some research on GPU topology, but there is no specific plan

Dec 28 '21 10:12 shinytang6

And I don't think using device-plugin to build gpu tree is a good way. Although device-plugin can use with volcano cooperatively, we still need to judge if the gpu resource in node is enough in function predicate. So why not build gpu tree in volcano directly and device-plugin just need to bind specific one gpu.

nvml needs to run on a specific node and then get topo info, how to build a topo tree in volcano directly？

or if we can input the gpu topology by string.

Dec 28 '21 10:12 jiangxiaobin96

And I don't think using device-plugin to build gpu tree is a good way. Although device-plugin can use with volcano cooperatively, we still need to judge if the gpu resource in node is enough in function predicate. So why not build gpu tree in volcano directly and device-plugin just need to bind specific one gpu.

nvml needs to run on a specific node and then get topo info, how to build a topo tree in volcano directly？

or if we can input the gpu topology by string.

not quite understand what you mean, do you mean that configure gpu topology of each node by configuration in volcano?

Dec 28 '21 10:12 shinytang6

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

Mar 30 '22 04:03 stale[bot]

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

Jul 30 '22 18:07 stale[bot]

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

Oct 01 '22 00:10 stale[bot]

volcano volcano copied to clipboard

RFVE: Support GPU topology

volcano
volcano copied to clipboard