Is it necessary to install and run the gpu-operator for time-slicing to work?
1. Issue or Feature description
Is it necessary to install and run the gpu-operator for time-slicing to work or is the device-plugin enough? Or is there something else that is wrong with my setup?
I have configured the nvidia-device-plugin with configs that implement time-slicing on my 4 RTX 3090 Ti cards and the gpu-nodes are exposing the correct number of time-sliced gpu:s (I have 4 hardware gpu:s and with time-slicing replication of 4 I see 16 gpu resources on the node)
I can start 10 concurrent pods all recieving a gpu.
When I run tensorflow analyses in the pods, I can only run analyses in as many concurrent pods as I have hardware gpus. If I run on more, the analyses in the pods will crash. I can run analyses sequential on all 10 pods so they all have access to gpus.
I have not installed the gpu-operator, is that necessary for the time-slicing to work? Or is there something else that is wrong with my setup?
my plugin config:
version: v1
flags:
migStrategy: "none"
failOnInitError: true
nvidiaDriverRoot: "/"
plugin:
passDeviceSpecs: false
deviceListStrategy: envvar
deviceIDStrategy: uuid
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4
These are the pods running in our cluster:
kubectl get po -n nvidia-device-plugin
NAME READY STATUS RESTARTS AGE
nvdp-gpu-feature-discovery-jvsr9 2/2 Running 0 2d11h
nvdp-gpu-feature-discovery-x4n4w 2/2 Running 0 2d11h
nvdp-node-feature-discovery-master-6954c9cd9c-76f4v 1/1 Running 0 2d11h
nvdp-node-feature-discovery-worker-6db58 1/1 Running 1 (2d11h ago) 2d11h
nvdp-node-feature-discovery-worker-6xr9b 1/1 Running 1 (2d11h ago) 2d11h
nvdp-node-feature-discovery-worker-97x7n 1/1 Running 0 2d11h
nvdp-node-feature-discovery-worker-nvzpj 1/1 Running 1 (2d11h ago) 2d11h
nvdp-node-feature-discovery-worker-qmrd4 1/1 Running 1 (2d11h ago) 2d11h
nvdp-nvidia-device-plugin-5lpjk 2/2 Running 0 2d11h
nvdp-nvidia-device-plugin-jbjcx 2/2 Running 0 2d11h
I am wondering as well.
It is not required to use the GPU Operator for timeslicing. One thing to note is that timeslicing does not prevent the different processes sharing the GPU from using all the GPUs memory. Could it be that the first pod scheduled to a GPU is consuming all the GPU memory?
I've tested time-slicing with RHEL 9.3, MicroShift 4.15.z, Standard NC4as T4 v3 (4 vcpus, 28 GiB memory) and NVIDIA TU104GL [Tesla T4] GPU, and it works as expected. Just created this PR #702 with a sample manifest.
@elezar is that make sense to have this PR accepted?
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.