k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

Is it necessary to install and run the gpu-operator for time-slicing to work?

Open andersla opened this issue 2 years ago • 3 comments

1. Issue or Feature description

Is it necessary to install and run the gpu-operator for time-slicing to work or is the device-plugin enough? Or is there something else that is wrong with my setup?

I have configured the nvidia-device-plugin with configs that implement time-slicing on my 4 RTX 3090 Ti cards and the gpu-nodes are exposing the correct number of time-sliced gpu:s (I have 4 hardware gpu:s and with time-slicing replication of 4 I see 16 gpu resources on the node)

I can start 10 concurrent pods all recieving a gpu.

When I run tensorflow analyses in the pods, I can only run analyses in as many concurrent pods as I have hardware gpus. If I run on more, the analyses in the pods will crash. I can run analyses sequential on all 10 pods so they all have access to gpus.

I have not installed the gpu-operator, is that necessary for the time-slicing to work? Or is there something else that is wrong with my setup?

my plugin config:

version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid
sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/gpu
      replicas: 4

These are the pods running in our cluster:

kubectl get po -n nvidia-device-plugin 
NAME                                                  READY   STATUS    RESTARTS        AGE
nvdp-gpu-feature-discovery-jvsr9                      2/2     Running   0               2d11h
nvdp-gpu-feature-discovery-x4n4w                      2/2     Running   0               2d11h
nvdp-node-feature-discovery-master-6954c9cd9c-76f4v   1/1     Running   0               2d11h
nvdp-node-feature-discovery-worker-6db58              1/1     Running   1 (2d11h ago)   2d11h
nvdp-node-feature-discovery-worker-6xr9b              1/1     Running   1 (2d11h ago)   2d11h
nvdp-node-feature-discovery-worker-97x7n              1/1     Running   0               2d11h
nvdp-node-feature-discovery-worker-nvzpj              1/1     Running   1 (2d11h ago)   2d11h
nvdp-node-feature-discovery-worker-qmrd4              1/1     Running   1 (2d11h ago)   2d11h
nvdp-nvidia-device-plugin-5lpjk                       2/2     Running   0               2d11h
nvdp-nvidia-device-plugin-jbjcx                       2/2     Running   0               2d11h

andersla avatar Nov 30 '23 12:11 andersla

I am wondering as well.

shahaf600 avatar Feb 06 '24 16:02 shahaf600

It is not required to use the GPU Operator for timeslicing. One thing to note is that timeslicing does not prevent the different processes sharing the GPU from using all the GPUs memory. Could it be that the first pod scheduled to a GPU is consuming all the GPU memory?

elezar avatar Feb 06 '24 21:02 elezar

I've tested time-slicing with RHEL 9.3, MicroShift 4.15.z, Standard NC4as T4 v3 (4 vcpus, 28 GiB memory) and NVIDIA TU104GL [Tesla T4] GPU, and it works as expected. Just created this PR #702 with a sample manifest.

@elezar is that make sense to have this PR accepted?

arthur-r-oliveira avatar May 10 '24 15:05 arthur-r-oliveira

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Feb 11 '25 04:02 github-actions[bot]