volcano icon indicating copy to clipboard operation
volcano copied to clipboard

Task Maxretry not work when using horovod.

Open whybeyoung opened this issue 3 years ago • 2 comments

yaml is here:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: aalllc
spec:
  maxRetry: 3
  minAvailable: 3
  plugins:
    ssh: []
    svc: []
  policies:
  - action: RestartJob
    event: PodEvicted
  queue: default
  schedulerName: volcano
  tasks:
  - maxRetry: 3
    minAvailable: 1
    name: master
    policies:
    - action: CompleteJob
      event: TaskCompleted
    - event: PodFailed
      action: RestartTask
      timeout: 30s
    replicas: 1
    template:
      metadata:
        labels:
          app: train-core
      spec:
        containers:
        - command:
          - /bin/sh
          - -c
          - "WORKER_HOST=`cat /etc/volcano/worker.host | awk 'ORS=\":1,\"'|head -c
            -1`;\nmkdir -p /var/run/sshd; /usr/sbin/sshd;\nCURRENT=$(date +\"%Y%m%d\")\n#CURRENT=20220217\necho
            $CURRENT\nTRAIN_COMMAND=\"horovodrun --log-level INFO  -np 2 -H ${WORKER_HOST}
            \"\nTRAIN_COMMAND=$TRAIN_COMMAND\necho ${WORKER_HOST}\neval $TRAIN_COMMAND
            python /work/code/ocpx/cvr/yb_test.py --data_path /work/data/taobaolaunch/taobao_launch
            --featmap_path /work/data/taobaolaunch/taobao_launch/featMap/part-00000
            \ --checkpoint_path /work/model/taobao_launch_open_deepfm_cvr --epochs
            4 --end_delta 60\n  \n"
          env:
          - name: CUDA_VISIBLE_DEVICES
            value: "-1"
          image: horovod-cpu:sklearn
          name: tensorflow
          ports:
          - containerPort: 22
            name: job-port
            protocol: TCP
          resources:
            limits:
              cpu: "12"
            requests:
              cpu: "1"
              memory: 1Gi
        nodeSelector:
          type: cpu
        restartPolicy: OnFailure

master的重启策略为OnFailure, worker部分定义我就没贴了。 因为 master要等 worker ssh ok, 所以这里必须设置为onfaiure(暂时没找到更好的办法),如果业务代码有问题, 这里也会无限重启, 所以希望maxretry 生效。 然而。。

image master一直在重启, job也无法失败。

whybeyoung avatar Jul 08 '22 03:07 whybeyoung

/assign

Thor-wl avatar Jul 08 '22 06:07 Thor-wl

TaskFailedEvent will be triggered when pods RestartCount reaches task maxRetry, but it seems that this event will not trigger any specific action, maybe you need to add a policy to make it take effect, e.g.

policies:
    - action: TerminateJob
      event: TaskFailed

@Thor-wl I don't know what action should be taken when the task maxretry is reached, what is the original setting?

HecarimV avatar Jul 28 '22 06:07 HecarimV

TaskFailedEvent will be triggered when pods RestartCount reaches task maxRetry, but it seems that this event will not trigger any specific action, maybe you need to add a policy to make it take effect, e.g.


policies:

    - action: TerminateJob

      event: TaskFailed

@Thor-wl I don't know what action should be taken when the task maxretry is reached, what is the original setting?

Yes,my finally resolution is doing the same thing as your said

whybeyoung avatar Aug 12 '22 01:08 whybeyoung

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Nov 12 '22 05:11 stale[bot]

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

stale[bot] avatar Jan 22 '23 08:01 stale[bot]