devices icon indicating copy to clipboard operation
devices copied to clipboard

ListAndWatch failed when managing large memory GPU such as NVIDIA Telas V100

Open zzr93 opened this issue 4 years ago • 16 comments

This issue is an extension of #18

What happened: Applying volcano-device-plugin on a server using 8*V100 GPU, but get volcano.sh/gpu-memory:0 when describe nodes: 6481637913961_ pic Same situation did not occur when using T4 or P4. Tracing kubelet logs, found following error message: 6491637914696_ pic_hd seems like sync message is too large.

What caused this bug: volcano-device-plugin mock GPUs into a device list(every device in this list is considered as a 1MB memory block), so that different workloads can share one GPU through kubernetes device plugin mechanism. When large memory GPU such as V100 is implemented, the size of device list exceeds the bound, and ListAndWatch failed as a result.

Solutions: The key is to minimize the size of the device list, so we can consider each device as a 10MB memory block and reform the whole bookkeeping process according to this assumption. This accuracy is enough for almost all production environments.

zzr93 avatar Nov 26 '21 09:11 zzr93

Thanks for your report and debug. The debug is meaningful and we will fix it as soon as possible.

Thor-wl avatar Nov 26 '21 09:11 Thor-wl

Request more voice about how much should be considered as a block(default is 1M) which is suitable for all specified GPU cards.

Thor-wl avatar Nov 30 '21 10:11 Thor-wl

100MB per block may work fine. Inference services usually cost hundreds to thousands MB memory(train services usually cost much more than this scale), so we actually do not care memory fragments which are less than 100MB.

zzr93 avatar Dec 01 '21 07:12 zzr93

100MB per block may work fine. Inference services usually cost hundreds to thousands MB memory(train services usually cost much more than this scale), so we actually do not care memory fragments which are less than 100MB.

IC. I'll take this issue to the weekly meeting for discussion. Are you glad to share your ideas in the meeting?

Thor-wl avatar Dec 02 '21 01:12 Thor-wl

100MB per block may work fine. Inference services usually cost hundreds to thousands MB memory(train services usually cost much more than this scale), so we actually do not care memory fragments which are less than 100MB.

IC. I'll take this issue to the weekly meeting for discussion. Are you glad to share your ideas in the meeting?

My pleasure, I will be in.

zzr93 avatar Dec 02 '21 07:12 zzr93

See you 15:00.

Thor-wl avatar Dec 03 '21 06:12 Thor-wl

See you 15:00.

Awww that's sweet.🥺

jasonliu747 avatar Dec 03 '21 06:12 jasonliu747

Is this issue resolved at present?

lakerhu999 avatar Jan 17 '22 01:01 lakerhu999

Is this issue resolved at present?

Not yet. We are considering for a graceful way to make the fix without modifing the gRPC directly.

Thor-wl avatar Jan 18 '22 09:01 Thor-wl

Any update for this issue?

lakerhu999 avatar Feb 15 '22 03:02 lakerhu999

Any update for this issue?

Not yet now. I'm sorry for developing another feature recently. Will fix that ASAP.

Thor-wl avatar Feb 16 '22 01:02 Thor-wl

It's still a bug in our product as same as this issue, if fixed, please close this issue.

lakerhu999 avatar Mar 01 '22 08:03 lakerhu999

It's still a bug in our product as same as this issue, if fixed, please close this issue.

OK, it's still on the way. I'll close the issue after the bug is fixed.

Thor-wl avatar Mar 01 '22 11:03 Thor-wl

How is this going?

pauky avatar Apr 11 '22 10:04 pauky

https://github.com/volcano-sh/devices/pull/22 may resolve this issue

shinytang6 avatar May 03 '22 09:05 shinytang6

我们这个最新的镜像公网上有发布吗?@shinytang6

XueleiQiao avatar Jul 05 '22 02:07 XueleiQiao