training-operator training_client.wait_for_job_condition throws runtime error when job has failed

This code is incorrect: sdk/python/kubeflow/training/api/training_client.py

A RuntimeError is always thrown if the job is failed. (This is not described in the doc, and not what the comment implies)

constants.JOB_CONDITION_FAILED is a string and conditions is a set of complex objects. It should be looking at expected_conditions.

In my case, I was waiting for CREATED, which happened before failure, so I was not expecting an exception at all. But I got one because the job failed before the CREATED was polled.

Doc string does not describe this behavior.

`
def wait_for_job_conditions( self, name: str, namespace: str = utils.get_default_target_namespace(), job_kind: str = constants.TFJOB_KIND, expected_conditions: Set = {constants.JOB_CONDITION_SUCCEEDED}, timeout: int = 600, polling_interval: int = 15, callback: Callable = None, apiserver_timeout: int = constants.DEFAULT_TIMEOUT, ): """Wait until Training Job reaches any of the specified conditions. By default it waits for the Succeeded condition.

[....]

# Raise an exception if Job is Failed and Failed is not expected condition. if ( constants.JOB_CONDITION_FAILED not in conditions and utils.has_condition(conditions, constants.JOB_CONDITION_FAILED) ): raise RuntimeError( f"{job_kind} {namespace}/{name} is Failed. " f"{job_kind} conditions: {job.status.conditions}" ) `

May 12 '23 19:05 ntl-ibm

@ntl-ibm Thank you for creating this issue.

Maybe, you can get all conditions if Jobs go to Failed in the following:

try:
  client.wait_for_job_conditions(name, namespace, job_kind, {constants.JOB_CONDITION_CREATED})
except:
  conditions = client.get_job_conditions(name, namespace, job_kind)
  print(conditions)

However, as you say, we shouldn't implicit this specification. We should add this to the docstring. cc: @andreyvelich

May 12 '23 21:05 tenzen-y

yep....I worked around the issue. I'm reporting it so that it can be fixed in the future. I'll let you determine the priority.

Thanks for the quick response!

May 12 '23 21:05 ntl-ibm

yep....I worked around the issue. I'm reporting it so that it can be fixed in the future. I'll let you determine the priority.

Thanks for the quick response!

We always welcome your contribution! Thank you for reporting this 👍

May 12 '23 22:05 tenzen-y

@ntl-ibm It would be great if you can contribute

May 17 '23 11:05 johnugeorge

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Aug 23 '23 20:08 github-actions[bot]

/remove-lifecycle stale

Aug 24 '23 13:08 tenzen-y

/help /good-first-issue

Aug 24 '23 13:08 tenzen-y

@tenzen-y: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to this:

/help /good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 24 '23 13:08 google-oss-prow[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jan 07 '24 05:01 github-actions[bot]

/lifecycle frozen

Jan 25 '24 17:01 tenzen-y

Hi, is this issue still relevant or been fixed already . What's the meaning of lifecycle/frozen? @tenzen-y

Mar 12 '24 10:03 matrixbot123

Hi, is this issue still relevant or been fixed already . What's the meaning of lifecycle/frozen? @tenzen-y

This issue isn’t yet resolved. The frozen mean that we wait for some contributors.

Mar 12 '24 10:03 tenzen-y

@tenzen-y and @johnugeorge From what I have gathered line 848 in training_client.py should read

constants.JOB_CONDITION_FAILED not in expected_conditions

I am not sure if we really need to change the docstring as lines 788 and 789 describe the default behaviour

        """Wait until Training Job reaches any of the specified conditions.
        By default it waits for the Succeeded condition.

and line 813 describes the expected exception when the job fails.

RuntimeError: Failed to get Job or Job reaches unexpected Failed condition.

Please let me know your feedback and I can then create a PR to make the changes.

Apr 03 '24 02:04 vector-flow

/assign

Apr 03 '24 02:04 vector-flow

@tenzen-y and @johnugeorge From what I have gathered line 848 in training_client.py should read

constants.JOB_CONDITION_FAILED not in expected_conditions

I am not sure if we really need to change the docstring as lines 788 and 789 describe the default behaviour
        """Wait until Training Job reaches any of the specified conditions.
        By default it waits for the Succeeded condition.
and line 813 describes the expected exception when the job fails.

RuntimeError: Failed to get Job or Job reaches unexpected Failed condition.

Please let me know your feedback and I can then create a PR to make the changes.

@tariq-hasan Thank you for investigating this issue. It seems that you have a point. Now, let me close this issue. Thanks again.

Jul 08 '24 16:07 tenzen-y

/close

Jul 08 '24 16:07 tenzen-y

@tenzen-y: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jul 08 '24 16:07 google-oss-prow[bot]