azure-sdk-for-python icon indicating copy to clipboard operation
azure-sdk-for-python copied to clipboard

Azure AI ML: Job ignores timeout value

Open davystrong opened this issue 2 years ago • 4 comments

  • Package Name: azure-ai-ml
  • Package Version: 0.1.0b6
  • Operating System: macOS 12.5
  • Python Version: 3.8.12

Describe the bug I have an Azure ML pipeline job defined in YAML. It consists of a single parallel run step which has a timeout of 1800 seconds. When I create a job with the not-even-public-preview CLI v2, it works as expected. However, when I submit it using the Python SDK v2, using either azure.ai.ml.load_job() or manually defining the pipeline in Python code, the timeout is ignored and each call to run() is cancelled after 90 seconds (the default).

To Reproduce Steps to reproduce the behavior:

  1. Create a pipeline with a parallel run step which lasts more than 90 seconds
  2. Submit the pipeline job using ml_client.jobs.create_or_update()
  3. Check the Portal logs to see it fail

Expected behavior The timeout should be set to 1800 seconds.

Additional context If I print the object I pass to ml_client.jobs.create_or_update() it includes a line with timeout: 1800. If I open logs/sys/error/0/process000.txt in the portal logs it includes "run_method_duration": 90.25895500183105, "status": "RUN_TIMEOUT".

davystrong avatar Aug 11 '22 19:08 davystrong

Label prediction was below confidence level 0.6 for Model:ServiceLabels: 'Data Factory:0.49535987,Event Hubs:0.049816728,Cognitive - Form Recognizer:0.022198211'

azure-sdk avatar Aug 11 '22 19:08 azure-sdk

Thank you for your feedback. This has been routed to the support team for assistance.

ghost avatar Aug 11 '22 20:08 ghost

@azureml-github

xiangyan99 avatar Aug 11 '22 20:08 xiangyan99

One further detail: if I submit the run via the CLI, open it in the portal and got to Parameters>Run settings, it says that "Run invocation timeout" is 1800. If I submit it via the SDK it says it's 60, with and identical config. Not sure why it says 60 there and only fails at 90 seconds, but anyway it's wrong.

davystrong avatar Aug 12 '22 15:08 davystrong

@davystrong For azure.ai.ml.load_job(), please use the yaml with this lines: args: >- --run_invocation_timeout 1800

bupt-wenxiaole avatar Aug 15 '22 06:08 bupt-wenxiaole

@davystrong for the "Not sure why it says 60 there and only fails at 90 seconds", it is expected, Parallel Run Step run time has a default system overhead, which is set to 30s, so the run failed at 30 + 60 = 90s.

bupt-wenxiaole avatar Aug 15 '22 07:08 bupt-wenxiaole

@bupt-wenxiaole, thanks for the suggestion! Where exactly should I put that? If I add it to the root of my config it throws a validation exception; if I add it to the parallel job section or add --run_invocation_timeout 1800 to my arguments in the parallel job component (which I create separately using the Python SDK), the "Run invocation timeout" value in the parameters section doesn't change.

davystrong avatar Aug 15 '22 12:08 davystrong

@davystrong , please upgrade the latest azure-ai-ml package and reference this example, and notes that add the --run_invocation_timeout 1800 to the "program_arguments".

bupt-wenxiaole avatar Aug 30 '22 03:08 bupt-wenxiaole

Hi @davystrong. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text “/unresolve” to remove the “issue-addressed” label and continue the conversation.

ghost avatar Aug 30 '22 15:08 ghost

Hi @davystrong, since you haven’t asked that we “/unresolve” the issue, we’ll close this out. If you believe further discussion is needed, please add a comment “/unresolve” to reopen the issue.

ghost avatar Sep 06 '22 16:09 ghost