gpu-manager
gpu-manager copied to clipboard
使用脚本批量启动20个pod调用vcuda时候,总会有一些失败
yaml文件
apiVersion: v1
kind: Pod
metadata:
name: coda__NAME__
annotations:
tencent.com/vcuda-core-limit: "5"
spec:
nodeSelector:
nvidia-device-enable: enable
restartPolicy: Always
containers:
- image: menghe.tencentcloudcr.com/public/tensorflow-gputest:0.2
name: nvidia__NAME__
command:
- /bin/bash
- -c
- "sleep infinity && cd /data/tensorflow/alexnet && time python alexnet_benchmark.py"
resources:
requests:
tencent.com/vcuda-core: "1"
tencent.com/vcuda-memory: "1"
limits:
tencent.com/vcuda-core: "1"
tencent.com/vcuda-memory: "1"
启动脚本
#!/bin/bash
for i in {1..20};do
sed "s/__NAME__/${i}/g" gpu-tmpl.yaml | kubectl apply -f -
#sleep 1
done
当添加sleep 1的时候就没有问题了,只有快速批量启动时候会出问题
gpu-manager use a watch cache to find allocated pod, and watcher is not notified as soon as kubelet saw the pod
gpu-manager use a watch cache to find allocated pod, and watcher is not notified as soon as kubelet saw the pod
are there any workarounds to solve this problem?
gpu-manager use a watch cache to find allocated pod, and watcher is not notified as soon as kubelet saw the pod
are there any workarounds to solve this problem?
Currently no, because the notification is sent by kube-apiserver