training_client.wait_for_job_condition throws runtime error when job has failed
This code is incorrect: sdk/python/kubeflow/training/api/training_client.py
A RuntimeError is always thrown if the job is failed. (This is not described in the doc, and not what the comment implies)
constants.JOB_CONDITION_FAILED is a string and conditions is a set of complex objects. It should be looking at expected_conditions.
In my case, I was waiting for CREATED, which happened before failure, so I was not expecting an exception at all. But I got one because the job failed before the CREATED was polled.
Doc string does not describe this behavior.
`
def wait_for_job_conditions(
self,
name: str,
namespace: str = utils.get_default_target_namespace(),
job_kind: str = constants.TFJOB_KIND,
expected_conditions: Set = {constants.JOB_CONDITION_SUCCEEDED},
timeout: int = 600,
polling_interval: int = 15,
callback: Callable = None,
apiserver_timeout: int = constants.DEFAULT_TIMEOUT,
):
"""Wait until Training Job reaches any of the specified conditions.
By default it waits for the Succeeded condition.
[....]
# Raise an exception if Job is Failed and Failed is not expected condition. if ( constants.JOB_CONDITION_FAILED not in conditions and utils.has_condition(conditions, constants.JOB_CONDITION_FAILED) ): raise RuntimeError( f"{job_kind} {namespace}/{name} is Failed. " f"{job_kind} conditions: {job.status.conditions}" ) `
@ntl-ibm Thank you for creating this issue.
Maybe, you can get all conditions if Jobs go to Failed in the following:
try:
client.wait_for_job_conditions(name, namespace, job_kind, {constants.JOB_CONDITION_CREATED})
except:
conditions = client.get_job_conditions(name, namespace, job_kind)
print(conditions)
However, as you say, we shouldn't implicit this specification. We should add this to the docstring. cc: @andreyvelich
yep....I worked around the issue. I'm reporting it so that it can be fixed in the future. I'll let you determine the priority.
Thanks for the quick response!
yep....I worked around the issue. I'm reporting it so that it can be fixed in the future. I'll let you determine the priority.
Thanks for the quick response!
We always welcome your contribution! Thank you for reporting this 👍
@ntl-ibm It would be great if you can contribute
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/remove-lifecycle stale
/help /good-first-issue
@tenzen-y: This request has been marked as suitable for new contributors.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.
In response to this:
/help /good-first-issue
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/lifecycle frozen
Hi, is this issue still relevant or been fixed already . What's the meaning of lifecycle/frozen? @tenzen-y
Hi, is this issue still relevant or been fixed already . What's the meaning of lifecycle/frozen? @tenzen-y
This issue isn’t yet resolved. The frozen mean that we wait for some contributors.
@tenzen-y and @johnugeorge From what I have gathered line 848 in training_client.py should read
constants.JOB_CONDITION_FAILED not in expected_conditions
I am not sure if we really need to change the docstring as lines 788 and 789 describe the default behaviour
"""Wait until Training Job reaches any of the specified conditions.
By default it waits for the Succeeded condition.
and line 813 describes the expected exception when the job fails.
RuntimeError: Failed to get Job or Job reaches unexpected Failed condition.
Please let me know your feedback and I can then create a PR to make the changes.
/assign
@tenzen-y and @johnugeorge From what I have gathered line 848 in training_client.py should read
constants.JOB_CONDITION_FAILED not in expected_conditionsI am not sure if we really need to change the docstring as lines 788 and 789 describe the default behaviour
"""Wait until Training Job reaches any of the specified conditions. By default it waits for the Succeeded condition.and line 813 describes the expected exception when the job fails.
RuntimeError: Failed to get Job or Job reaches unexpected Failed condition.Please let me know your feedback and I can then create a PR to make the changes.
@tariq-hasan Thank you for investigating this issue. It seems that you have a point. Now, let me close this issue. Thanks again.
/close
@tenzen-y: Closing this issue.
In response to this:
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.