training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

training_client.wait_for_job_condition throws runtime error when job has failed

Open ntl-ibm opened this issue 2 years ago • 14 comments

This code is incorrect: sdk/python/kubeflow/training/api/training_client.py

A RuntimeError is always thrown if the job is failed. (This is not described in the doc, and not what the comment implies)

constants.JOB_CONDITION_FAILED is a string and conditions is a set of complex objects. It should be looking at expected_conditions.

In my case, I was waiting for CREATED, which happened before failure, so I was not expecting an exception at all. But I got one because the job failed before the CREATED was polled.

Doc string does not describe this behavior.

`
def wait_for_job_conditions( self, name: str, namespace: str = utils.get_default_target_namespace(), job_kind: str = constants.TFJOB_KIND, expected_conditions: Set = {constants.JOB_CONDITION_SUCCEEDED}, timeout: int = 600, polling_interval: int = 15, callback: Callable = None, apiserver_timeout: int = constants.DEFAULT_TIMEOUT, ): """Wait until Training Job reaches any of the specified conditions. By default it waits for the Succeeded condition.

[....]

# Raise an exception if Job is Failed and Failed is not expected condition. if ( constants.JOB_CONDITION_FAILED not in conditions and utils.has_condition(conditions, constants.JOB_CONDITION_FAILED) ): raise RuntimeError( f"{job_kind} {namespace}/{name} is Failed. " f"{job_kind} conditions: {job.status.conditions}" ) `

ntl-ibm avatar May 12 '23 19:05 ntl-ibm

@ntl-ibm Thank you for creating this issue.

Maybe, you can get all conditions if Jobs go to Failed in the following:

try:
  client.wait_for_job_conditions(name, namespace, job_kind, {constants.JOB_CONDITION_CREATED})
except:
  conditions = client.get_job_conditions(name, namespace, job_kind)
  print(conditions)

However, as you say, we shouldn't implicit this specification. We should add this to the docstring. cc: @andreyvelich

tenzen-y avatar May 12 '23 21:05 tenzen-y

yep....I worked around the issue. I'm reporting it so that it can be fixed in the future. I'll let you determine the priority.

Thanks for the quick response!

ntl-ibm avatar May 12 '23 21:05 ntl-ibm

yep....I worked around the issue. I'm reporting it so that it can be fixed in the future. I'll let you determine the priority.

Thanks for the quick response!

We always welcome your contribution! Thank you for reporting this 👍

tenzen-y avatar May 12 '23 22:05 tenzen-y

@ntl-ibm It would be great if you can contribute

johnugeorge avatar May 17 '23 11:05 johnugeorge

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 23 '23 20:08 github-actions[bot]

/remove-lifecycle stale

tenzen-y avatar Aug 24 '23 13:08 tenzen-y

/help /good-first-issue

tenzen-y avatar Aug 24 '23 13:08 tenzen-y

@tenzen-y: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to this:

/help /good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Aug 24 '23 13:08 google-oss-prow[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jan 07 '24 05:01 github-actions[bot]

/lifecycle frozen

tenzen-y avatar Jan 25 '24 17:01 tenzen-y

Hi, is this issue still relevant or been fixed already . What's the meaning of lifecycle/frozen? @tenzen-y

matrixbot123 avatar Mar 12 '24 10:03 matrixbot123

Hi, is this issue still relevant or been fixed already . What's the meaning of lifecycle/frozen? @tenzen-y

This issue isn’t yet resolved. The frozen mean that we wait for some contributors.

tenzen-y avatar Mar 12 '24 10:03 tenzen-y

@tenzen-y and @johnugeorge From what I have gathered line 848 in training_client.py should read

constants.JOB_CONDITION_FAILED not in expected_conditions

I am not sure if we really need to change the docstring as lines 788 and 789 describe the default behaviour

        """Wait until Training Job reaches any of the specified conditions.
        By default it waits for the Succeeded condition.

and line 813 describes the expected exception when the job fails.

RuntimeError: Failed to get Job or Job reaches unexpected Failed condition.

Please let me know your feedback and I can then create a PR to make the changes.

vector-flow avatar Apr 03 '24 02:04 vector-flow

/assign

vector-flow avatar Apr 03 '24 02:04 vector-flow

@tenzen-y and @johnugeorge From what I have gathered line 848 in training_client.py should read

constants.JOB_CONDITION_FAILED not in expected_conditions

I am not sure if we really need to change the docstring as lines 788 and 789 describe the default behaviour

        """Wait until Training Job reaches any of the specified conditions.
        By default it waits for the Succeeded condition.

and line 813 describes the expected exception when the job fails.

RuntimeError: Failed to get Job or Job reaches unexpected Failed condition.

Please let me know your feedback and I can then create a PR to make the changes.

@tariq-hasan Thank you for investigating this issue. It seems that you have a point. Now, let me close this issue. Thanks again.

tenzen-y avatar Jul 08 '24 16:07 tenzen-y

/close

tenzen-y avatar Jul 08 '24 16:07 tenzen-y

@tenzen-y: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

google-oss-prow[bot] avatar Jul 08 '24 16:07 google-oss-prow[bot]