mpi-operator icon indicating copy to clipboard operation
mpi-operator copied to clipboard

activeDeadlineSeconds not working in release v0.2.3

Open merryzhou opened this issue 5 years ago • 5 comments

spec.runPolicy.activeDeadlineSeconds was set to 300s, mpijob was running over 22 minites and not terminated.

By reading the code, I found that activeDeadlineSeconds was not processed in mpi-operator, so is this a bug?

# mpijob:
apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: mpijob-v1-3
  namespace: default
spec:
    ...
  runPolicy:
    activeDeadlineSeconds: 300
    backoffLimit: 2

# mpijob status
status:
  conditions:
  - lastTransitionTime: "2020-07-06T05:15:20Z"
    lastUpdateTime: "2020-07-06T05:15:20Z"
    message: MPIJob default/mpijob-v1-3 is created.
    reason: MPIJobCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2020-07-06T05:59:47Z"
    lastUpdateTime: "2020-07-06T05:59:47Z"
    message: MPIJob default/mpijob-v1-3 is running.
    reason: MPIJobRunning
    status: "True"
    type: Running
  replicaStatuses:
    Launcher:
      active: 1
    Worker:
      active: 1

# running pod
NAME                   READY   STATUS    RESTARTS   AGE
mpijob-v1-3-launcher   1/1     Running   0          22m
mpijob-v1-3-worker-0   1/1     Running   0          22m

merryzhou avatar Jul 06 '20 07:07 merryzhou

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/bug 0.89

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] avatar Jul 06 '20 07:07 issue-label-bot[bot]

v1 controller does not use this field anymore (removed in https://github.com/kubeflow/mpi-operator/pull/203). cc @carmark

terrytangyuan avatar Jul 06 '20 14:07 terrytangyuan

@merryzhou Yes, this filed was removed in V1, since there is no launcher job in V1 controller, could you please describe why you need it ?

carmark avatar Jul 07 '20 01:07 carmark

Actually, I want to have a field to indicate "The maximum time for mpijob to progress to running status before it is considered to be failed", maybe a bit like ProgressDeadlineSeconds in Deployment.Spec。

Currently, some unexpected error like "pod ImagePullBackOff" will make mpijob stuck in Created status, so I think it's better to have a mechanism to set the mpijob that can't run due to some unexpected conditions to fail state。

# pod
NAME                              READY   STATUS             RESTARTS   AGE
mpijob-v1-2-launcher              0/1     Init:0/1           0          11m
mpijob-v1-2-worker-0              0/1     ImagePullBackOff   0          11m

# mpijob status:
  status:
    conditions:
    - lastTransitionTime: "2020-07-07T07:20:57Z"
      lastUpdateTime: "2020-07-07T07:20:57Z"
      message: MPIJob default/mpijob-v1-2 is created.
      reason: MPIJobCreated
      status: "True"
      type: Created
    replicaStatuses:
      Launcher: {}
      Worker: {}
    startTime: "2020-07-07T07:20:57Z"

merryzhou avatar Jul 07 '20 07:07 merryzhou

@merryzhou Thanks for your information, I will file a PR to support it.

carmark avatar Jul 07 '20 07:07 carmark

What's the status of this issue?

yzhao-2023 avatar Dec 10 '23 06:12 yzhao-2023

Please upgrade to v0.3 or v0.4

/close

alculquicondor avatar Dec 11 '23 14:12 alculquicondor

@alculquicondor: Closing this issue.

In response to this:

Please upgrade to v0.3 or v0.4

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Dec 11 '23 14:12 google-oss-prow[bot]