volcano
volcano copied to clipboard
vgpu not restricting memory in the container
What happened:
When running the vgpu example provided in the docs. When vgpu memory limit is set, the container does not respect this limit as shown by the nvidia-smi command(32GB memory is shown as output for V100)
What you expected to happen:
The memory inside the container should be limited to vgpu-memory configuration.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?: Nvidia-smi version: 545.23.08 MIG M: NA
Environment:
- Volcano Version: 1.8.x
- Kubernetes version (use
kubectl version
): v1.28.x - Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a
): - Install tools:
- Others:
/assign @archlitchi
Hey @archlitchi, Can you suggest something for this?
Hey @archlitchi, Can you suggest something for this?
could you provide the following information:
- The vgpu-task yaml you submitted?
- "env" result inside container
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod12
spec:
schedulerName: volcano
containers:
- name: ubuntu-container
image: ubuntu:18.04
command: ["bash", "-c", "sleep 86400"]
resources:
limits:
volcano.sh/vgpu-number: 1
volcano.sh/vgpu-memory: 200
nodeSelector: ...
tolerations: ...
EOF
NodeSelector and tolerations are private, therefore can't show them here. Let me know if these properties can also affect the behavior of vgpu
cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: gpu-pod12 spec: schedulerName: volcano containers: - name: ubuntu-container image: ubuntu:18.04 command: ["bash", "-c", "sleep 86400"] resources: limits: volcano.sh/vgpu-number: 1 volcano.sh/vgpu-memory: 3000 nodeSelector: ... tolerations: ... EOF
NodeSelector and tolerations are private, therefore can't show them here. Let me know if these properties can also affect the behavior of vgpu
Could you provide the 'env' result inside container?
wont be able to copy the complete output. If you are looking for a particular property, i should be able to get that for you
okay, please list env which contains keyword 'CUDA' or 'NVIDIA'
Did not print output of NVIDIA_REQUIRE_CUDA because its too long to type. Please bear with me
NVIDIA_VISIBLE_DEVICES=GPU-c571e691-40c8-ee08-1ebc-2b28c2258b76
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
NV_CUDA_CUDART_VERSION=11.8.89-1
CUDA_VERSION=11.8.0
NVCUDA_LIB_VERSION=11.8.0-1
CUDA_DEVICE_MEMORY_LIMIT_0=200m
CUDA_DEVICE_MEMORY_SHARED_CACHE=/tmp/vgpu/<hash>.cache
Did not print output of NVIDIA_REQUIRE_CUDA because its too long to type. Please bear with me
NVIDIA_VISIBLE_DEVICES=GPU-c571e691-40c8-ee08-1ebc-2b28c2258b76 NVIDIA_DRIVER_CAPABILITIES=compute,utility NVIDIA_PRODUCT_NAME=CUDA NV_CUDA_CUDART_VERSION=11.8.89-1 CUDA_VERSION=11.8.0 NVCUDA_LIB_VERSION=11.8.0-1 CUDA_DEVICE_MEMORY_LIMIT_0=200m CUDA_DEVICE_MEMORY_SHARED_CACHE=/tmp/vgpu/<hash>.cache
em.... IS this container the one in the yaml file? you allocated 3G in your yaml, but here it only gets 200M, besides, this is probably a cuda image, not a typical ubuntu:18.04
sorry, i ran a different yaml, everything else is same except memory is 200m, updated the earlier comment as well
sorry, i ran a different yaml, everything else is same except memory is 200m, updated the earlier comment as well
Please check if the following file exists inside container, AND the size of each file does NOT equal to 0:
- /usr/local/vgpu/libvgpu.so
- /etc/ld.so.preload
/usr/local/vgpu/libvgpu.so -> Exists with non 0 size /etc/ld.so.preload > does not exist
/usr/local/vgpu/libvgpu.so -> Exists with non 0 size /etc/ld.so.preload > does not exist
okay, i got it, please use the following image volcanosh/volcano-vgpu-device-plugin:dev-vgpu-1219 instead in volcano-vgpu-device-plugin.yml
okay, let me try this!!!
Hey @archlitchi, The mentioned error is on the same image(volcanosh/volcano-vgpu-device-plugin:dev-vgpu-1219). It was deployed a month ago, has anything changed since then?
Hey @archlitchi, any other suggestions to fix this?
Hi @archlitchi , i am also facing same issue with volcano vGPU feature. Could you guide me enable this feature. Thanks in advance.
Hi @archlitchi , i am also facing same issue with volcano vGPU feature. Could you guide me enable this feature. Thanks in advance.
@kunal642
ok, i'm looking into it now, sorry i didn't see your replies last two weeks
@EswarS @kunal642 please use image projecthami/volcano-vgpu-device-plugin:v1.9.0 instead, volcano image will no longer provide hard device isolation among containers due to community policies.
@archlitchi is the usage same for vgpu-memory and vgpu-number configurations?
@EswarS @kunal642 please use image projecthami/volcano-vgpu-device-plugin:v1.9.0 instead, volcano image will no longer provide hard device isolation among containers due to community policies.
Is this device plugin compatible with volcano 1.8.2 release package.
I deployed the device plugin Facing following error Initializing ….. Fail to open shrreg ***.cache (errorno:11) Fail to init shrreg ****.cache (errorno:9) Fail to write shrreg ***.cache (errorno:9) Fail to reseek shrreg ***.cache (errorno:9) Fail to lock shrreg ***.cache (errorno:9)
@archlitchi is the usage same for vgpu-memory and vgpu-number configurations?
yes, can you run your task now?
@EswarS @kunal642 please use image projecthami/volcano-vgpu-device-plugin:v1.9.0 instead, volcano image will no longer provide hard device isolation among containers due to community policies.
Is this device plugin compatible with volcano 1.8.2 release package.
I deployed the device plugin Facing following error Initializing ….. Fail to open shrreg ***.cache (errorno:11) Fail to init shrreg ****.cache (errorno:9) Fail to write shrreg ***.cache (errorno:9) Fail to reseek shrreg ***.cache (errorno:9) Fail to lock shrreg ***.cache (errorno:9)
The vgpu-device-plugin mounts your hostPath "/tmp/vgpu/containers/{containerUID}_{ctrName}" into containerPath "/tmp/vgpu" please check if the corresponding hostPath exists
VolumeMounts: -mountPath: /car/lib/kubelet/device-plugins -name: device-plugin -mountPath:/usr/local/vgpu -name: lib -mountPath: /tmp -name: hosttmp
like above are the volumes configured in device-plugin daemon. Do I need to make any changes?
@EswarS No, i mean, after you submit a vgpu task into volcano, please check
- Is the corresponding folder "/tmp/vgpu/containers/{containerUID}_{ctrName}" exists in your corresponding GPU node.
- is the folder "/tmp/vgpu" exists inside the vgpu-task container
I have the same problem in 1.8.1 version.
The graphics card is RTX 3090 24GB, the container is set to limits volcano.sh/vgpu-memory: '10240'
, when executing nvidia-smi
in the container, although the graphics card displays a memory of 10240MiB, in reality the process can use more memory, such as 20480MiB.
What I expect is that the process can not use memory beyond limits, which is 10240Mib
@archlitchi , we submitted pod with kubectl and pod volume mount path : emptydir {}, it works. Our observation in this case is “pod owner is root.”
do we really need to add emptydir{} volume path ?
when same pod submitted by another non root user ( namespace user) , it is not able to access the folder. /tmp/vgpu has 777 root:root permission at node level.
here I have a usecase ,where different namespace users share same gpu and /tmp/vgpu needs write permission. I cannot set group as namespace group.
Could you suggest how to handle this problem.
Is libvgpu.so loaded successfully?
Check logs of your pod which may confirm you. If already loaded problem may be with mounts, check your pod securityContext fsGroup.
Just check your volume mounts on your pod and Also check permission on node /tmp/vgpu and /tmp/vgpulock folder .
On Tue, 11 Jun 2024 at 9:38 AM, Ashin Woo @.***> wrote:
I have the same problem,The graphics card is RTX 3090 24GB, the container is set to limits volcano.sh/vgpu-memory: '10240', when executing nvidia-smi in the container, although the graphics card displays a memory of 10240MiB, in reality the process can use more memory, such as 20480MiB. What I expect is that the process can not use memory beyond limits, which is 10240Mib
— Reply to this email directly, view it on GitHub https://github.com/volcano-sh/volcano/issues/3384#issuecomment-2159742156, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLYDM525LXH73ZKUKDG5R3ZGZZ5BAVCNFSM6AAAAABFVHLK46VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJZG42DEMJVGY . You are receiving this because you were mentioned.Message ID: @.***>
@archlitchi , we submitted pod with kubectl and pod volume mount path : emptydir {}, it works. Our observation in this case is “pod owner is root.”
do we really need to add emptydir{} volume path ?
when same pod submitted by another non root user ( namespace user) , it is not able to access the folder. /tmp/vgpu has 777 root:root permission at node level.
here I have a usecase ,where different namespace users share same gpu and /tmp/vgpu needs write permission. I cannot set group as namespace group.
Could you suggest how to handle this problem.
okay, i got it, we will try to mount it into '/usr/local/vgpu' folder inside container in next version
@archlitchi , I have one more question, why can’t we allocate vgpus more than physical gpus in a single container.