HAMi
HAMi copied to clipboard

Published 20 hours ago •

Reame
Issues

The allocated GPU memory does not match the actual one.

Open chaunceyjiang opened this issue 6 months ago • 2 comments

1. Issue or feature description

I only requested 1024 MiB of GPU memory for the pod, but in reality, it can use up to 30480 MiB of GPU memory.

In the end, I found out that it was because there happened to be a POD on this node that had requested 30480 MiB of GPU memory and was restarting.

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Common error checking:

[ ] The output of nvidia-smi -a on your host
[ ] Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
[ ] The vgpu-device-plugin container logs
[ ] The vgpu-scheduler container logs
[ ] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

[ ] Docker version from docker version
[ ] Docker command, image and tag used
[ ] Kernel version from uname -a
[ ] Any relevant kernel output lines from dmesg

Jul 29 '24 09:07 chaunceyjiang