gpu-manager 使用脚本批量启动20个pod调用vcuda时候，总会有一些失败

使用脚本批量启动20个pod调用vcuda时候，总会有一些失败

Open Tesladw opened this issue 3 years ago • 3 comments

yaml文件

apiVersion: v1
kind: Pod
metadata:
  name:  coda__NAME__
  annotations:
    tencent.com/vcuda-core-limit: "5"
spec:
  nodeSelector:
     nvidia-device-enable: enable
  restartPolicy: Always
  containers:
  - image:  menghe.tencentcloudcr.com/public/tensorflow-gputest:0.2
    name: nvidia__NAME__
    command:
      - /bin/bash
      - -c
      - "sleep infinity && cd /data/tensorflow/alexnet && time python alexnet_benchmark.py"
    resources:
      requests:
        tencent.com/vcuda-core: "1"
        tencent.com/vcuda-memory: "1"
      limits:
        tencent.com/vcuda-core: "1"
        tencent.com/vcuda-memory: "1"

启动脚本

#!/bin/bash
for i in {1..20};do
	sed "s/__NAME__/${i}/g" gpu-tmpl.yaml | kubectl apply -f -
        #sleep 1
done

当添加sleep 1的时候就没有问题了，只有快速批量启动时候会出问题

May 07 '21 03:05 Tesladw

gpu-manager use a watch cache to find allocated pod, and watcher is not notified as soon as kubelet saw the pod

May 07 '21 08:05 mYmNeo

gpu-manager use a watch cache to find allocated pod, and watcher is not notified as soon as kubelet saw the pod

are there any workarounds to solve this problem?

May 07 '21 08:05 lxm

gpu-manager use a watch cache to find allocated pod, and watcher is not notified as soon as kubelet saw the pod

are there any workarounds to solve this problem?

Currently no, because the notification is sent by kube-apiserver

May 08 '21 02:05 mYmNeo

gpu-manager gpu-manager copied to clipboard

使用脚本批量启动20个pod调用vcuda时候，总会有一些失败

gpu-manager
gpu-manager copied to clipboard