Task Maxretry not work when using horovod.
yaml is here:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: aalllc
spec:
maxRetry: 3
minAvailable: 3
plugins:
ssh: []
svc: []
policies:
- action: RestartJob
event: PodEvicted
queue: default
schedulerName: volcano
tasks:
- maxRetry: 3
minAvailable: 1
name: master
policies:
- action: CompleteJob
event: TaskCompleted
- event: PodFailed
action: RestartTask
timeout: 30s
replicas: 1
template:
metadata:
labels:
app: train-core
spec:
containers:
- command:
- /bin/sh
- -c
- "WORKER_HOST=`cat /etc/volcano/worker.host | awk 'ORS=\":1,\"'|head -c
-1`;\nmkdir -p /var/run/sshd; /usr/sbin/sshd;\nCURRENT=$(date +\"%Y%m%d\")\n#CURRENT=20220217\necho
$CURRENT\nTRAIN_COMMAND=\"horovodrun --log-level INFO -np 2 -H ${WORKER_HOST}
\"\nTRAIN_COMMAND=$TRAIN_COMMAND\necho ${WORKER_HOST}\neval $TRAIN_COMMAND
python /work/code/ocpx/cvr/yb_test.py --data_path /work/data/taobaolaunch/taobao_launch
--featmap_path /work/data/taobaolaunch/taobao_launch/featMap/part-00000
\ --checkpoint_path /work/model/taobao_launch_open_deepfm_cvr --epochs
4 --end_delta 60\n \n"
env:
- name: CUDA_VISIBLE_DEVICES
value: "-1"
image: horovod-cpu:sklearn
name: tensorflow
ports:
- containerPort: 22
name: job-port
protocol: TCP
resources:
limits:
cpu: "12"
requests:
cpu: "1"
memory: 1Gi
nodeSelector:
type: cpu
restartPolicy: OnFailure
master的重启策略为OnFailure, worker部分定义我就没贴了。 因为 master要等 worker ssh ok, 所以这里必须设置为onfaiure(暂时没找到更好的办法),如果业务代码有问题, 这里也会无限重启, 所以希望maxretry 生效。 然而。。
master一直在重启, job也无法失败。
/assign
TaskFailedEvent will be triggered when pods RestartCount reaches task maxRetry, but it seems that this event will not trigger any specific action, maybe you need to add a policy to make it take effect, e.g.
policies:
- action: TerminateJob
event: TaskFailed
@Thor-wl I don't know what action should be taken when the task maxretry is reached, what is the original setting?
TaskFailedEventwill be triggered when podsRestartCountreaches taskmaxRetry, but it seems that this event will not trigger any specific action, maybe you need to add a policy to make it take effect, e.g.policies: - action: TerminateJob event: TaskFailed@Thor-wl I don't know what action should be taken when the task maxretry is reached, what is the original setting?
Yes,my finally resolution is doing the same thing as your said
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗