volcano icon indicating copy to clipboard operation
volcano copied to clipboard

RFVE: Support GPU topology

Open k82cn opened this issue 4 years ago • 22 comments

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature /area scheduling /priority important-soon

Description:

GPU topology is important to the performance of running tasks, it necessary to imporve both scheduler and kubelet for GPU topology.

k82cn avatar Dec 19 '19 02:12 k82cn

/cc @carmark @Jeffwan

k82cn avatar Dec 19 '19 02:12 k82cn

I found one relevent project on this year's google summer of code project. https://summerofcode.withgoogle.com/archive/2019/projects/6336863634194432/

Rui-Tang avatar Dec 20 '19 08:12 Rui-Tang

@k82cn @Rui-Tang You may refer this project as a example.

carmark avatar Dec 20 '19 08:12 carmark

@carmark , thanks very much for the info :) I'd like to build something based on that example :)

k82cn avatar Dec 24 '19 08:12 k82cn

/kind rfve

k82cn avatar Dec 24 '19 08:12 k82cn

I was taking days off recently and didn't get a chance to attend the meeting.

Topology Manager integration with device plugin is in the latest kubernetes. Do we plan to leverage this or do something different? https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-integration-with-the-topology-manager

Jeffwan avatar Dec 26 '19 22:12 Jeffwan

@Jeffwan The official topology manager for Device Plugin is just for the devices with socket information. But for different devices and topology, we may need more information and more detailed scheduler for better performance, such as GPUs, select the two GPUs in a Socket or in a PIX or in a PHB?

image

carmark avatar Dec 27 '19 01:12 carmark

@Jeffwan The official topology manager for Device Plugin is just for the devices with socket information. But for different devices and topology, we may need more information and more detailed scheduler for better performance, such as GPUs, select the two GPUs in a Socket or in a PIX or in a PHB?

image

Yeah. It makes sense.

Jeffwan avatar Dec 27 '19 04:12 Jeffwan

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Aug 18 '20 06:08 stale[bot]

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

stale[bot] avatar Oct 17 '20 07:10 stale[bot]

/feature

Thor-wl avatar Apr 16 '21 07:04 Thor-wl

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Jul 15 '21 09:07 stale[bot]

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Oct 13 '21 12:10 stale[bot]

How is this issue going?

tianzichenone avatar Dec 22 '21 04:12 tianzichenone

How about using nvml.DeviceGetTopologyCommonAncestor to build gpu tree

jiangxiaobin96 avatar Dec 28 '21 10:12 jiangxiaobin96

And I don't think using device-plugin to build gpu tree is a good way. Although device-plugin can use with volcano cooperatively, we still need to judge if the gpu resource in node is enough in function predicate. So why not build gpu tree in volcano directly and device-plugin just need to bind specific one gpu.

jiangxiaobin96 avatar Dec 28 '21 10:12 jiangxiaobin96

And I don't think using device-plugin to build gpu tree is a good way. Although device-plugin can use with volcano cooperatively, we still need to judge if the gpu resource in node is enough in function predicate. So why not build gpu tree in volcano directly and device-plugin just need to bind specific one gpu.

nvml needs to run on a specific node and then get topo info, how to build a topo tree in volcano directly?

shinytang6 avatar Dec 28 '21 10:12 shinytang6

How is this issue going?

Not so much, https://github.com/volcano-sh/volcano/pull/1779 does some research on GPU topology, but there is no specific plan

shinytang6 avatar Dec 28 '21 10:12 shinytang6

And I don't think using device-plugin to build gpu tree is a good way. Although device-plugin can use with volcano cooperatively, we still need to judge if the gpu resource in node is enough in function predicate. So why not build gpu tree in volcano directly and device-plugin just need to bind specific one gpu.

nvml needs to run on a specific node and then get topo info, how to build a topo tree in volcano directly?

or if we can input the gpu topology by string.

jiangxiaobin96 avatar Dec 28 '21 10:12 jiangxiaobin96

And I don't think using device-plugin to build gpu tree is a good way. Although device-plugin can use with volcano cooperatively, we still need to judge if the gpu resource in node is enough in function predicate. So why not build gpu tree in volcano directly and device-plugin just need to bind specific one gpu.

nvml needs to run on a specific node and then get topo info, how to build a topo tree in volcano directly?

or if we can input the gpu topology by string.

not quite understand what you mean, do you mean that configure gpu topology of each node by configuration in volcano?

shinytang6 avatar Dec 28 '21 10:12 shinytang6

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Mar 30 '22 04:03 stale[bot]

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Jul 30 '22 18:07 stale[bot]

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

stale[bot] avatar Oct 01 '22 00:10 stale[bot]