devices icon indicating copy to clipboard operation
devices copied to clipboard

gpu number无法使用

Open Trainbow opened this issue 2 years ago • 11 comments

Trainbow avatar Jan 18 '23 01:01 Trainbow

你好,我在尝试volcano gpu number的服务调度,在根据volcano的教程步骤安装之后,每一个带gpu的node都能够正确的显示有多少块gpu,但是在创建pod的时候,container的容器中没有volcano-gpu-number这一个环境变量,在里面输入nvidia-smi能够看到该节点所有的gpu,想问一下是否需要更改yaml文件?

Trainbow avatar Jan 18 '23 01:01 Trainbow

你好,我在尝试volcano gpu number的服务调度,在根据volcano的教程步骤安装之后,每一个带gpu的node都能够正确的显示有多少块gpu,但是在创建pod的时候,container的容器中没有volcano-gpu-number这一个环境变量,在里面输入nvidia-smi能够看到该节点所有的gpu,想问一下是否需要更改yaml文件?

Hey, which version do you make use of?

Thor-wl avatar Jan 19 '23 01:01 Thor-wl

你好,我在尝试volcano gpu number的服务调度,在根据volcano的教程步骤安装之后,每一个带gpu的node都能够正确的显示有多少块gpu,但是在创建pod的时候,container的容器中没有volcano-gpu-number这一个环境变量,在里面输入nvidia-smi能够看到该节点所有的gpu,想问一下是否需要更改yaml文件?

Hey, which version do you make use of?

volcano-1.6.0

Trainbow avatar Jan 28 '23 06:01 Trainbow

/cc @wangyang0616 Can you help take a look?

Thor-wl avatar Jan 29 '23 01:01 Thor-wl

/cc @wangyang0616 Can you help take a look?

ok, let me take a look

wangyang0616 avatar Jan 29 '23 01:01 wangyang0616

@Trainbow Is it convenient to post the yaml file for creating the test task? By the way, can it be successfully scheduled using the default scheduler of k8s?

wangyang0616 avatar Jan 29 '23 01:01 wangyang0616

@Trainbow Is it convenient to post the yaml file for creating the test task? By the way, can it be successfully scheduled using the default scheduler of k8s?

I used the sample yaml in vaolcano-gpu-number readme.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod1
  namespace: model
spec:
  containers:
    - name: cuda-container
      image: nvidia/cuda:9.0-devel
      command: ["sleep"]
      args: ["100000"]
      resources:
        limits:
          volcano.sh/gpu-number: 1 # requesting 1 gpu cards
          # nvidia.com/gpu: 1

I also installed nvidia's k8s-device-plugin for testing. For example, when the limits field used nvidia.com/gpu, the pod's container works well, and it has one gpu devices. When i used volcano.sh/gpu-number, the container's env doesn't have the variable VOLCANO_GPU_ALLOCATED, the NVIDIA_VISIBLE_DEVICES is all. I tried the gpu-sharing with volcano, according to the official tutorial to test, I can find the corresponding environment variables in the pod.

Trainbow avatar Jan 29 '23 02:01 Trainbow

Volcano Device Plugin GPUSTRATEGY default is the Share mode, that is, you can use the Volcano.sh/GPU-MEMOMORY. If you use the volcano.sh/gpu-number, you need number`, see for details: config-the-volcano-device-plugin-binary

Hope the above information is helpful to you.

wangyang0616 avatar Mar 09 '23 08:03 wangyang0616