azure-sdk-for-python Azure AI ML: Job ignores timeout value

Package Name: azure-ai-ml
Package Version: 0.1.0b6
Operating System: macOS 12.5
Python Version: 3.8.12

Describe the bug I have an Azure ML pipeline job defined in YAML. It consists of a single parallel run step which has a timeout of 1800 seconds. When I create a job with the not-even-public-preview CLI v2, it works as expected. However, when I submit it using the Python SDK v2, using either azure.ai.ml.load_job() or manually defining the pipeline in Python code, the timeout is ignored and each call to run() is cancelled after 90 seconds (the default).

To Reproduce Steps to reproduce the behavior:

Create a pipeline with a parallel run step which lasts more than 90 seconds
Submit the pipeline job using ml_client.jobs.create_or_update()
Check the Portal logs to see it fail

Expected behavior The timeout should be set to 1800 seconds.

Additional context If I print the object I pass to ml_client.jobs.create_or_update() it includes a line with timeout: 1800. If I open logs/sys/error/0/process000.txt in the portal logs it includes "run_method_duration": 90.25895500183105, "status": "RUN_TIMEOUT".

Aug 11 '22 19:08 davystrong

Label prediction was below confidence level 0.6 for Model:ServiceLabels: 'Data Factory:0.49535987,Event Hubs:0.049816728,Cognitive - Form Recognizer:0.022198211'

Aug 11 '22 19:08 azure-sdk

Thank you for your feedback. This has been routed to the support team for assistance.

Aug 11 '22 20:08 ghost

@azureml-github

Aug 11 '22 20:08 xiangyan99

One further detail: if I submit the run via the CLI, open it in the portal and got to Parameters>Run settings, it says that "Run invocation timeout" is 1800. If I submit it via the SDK it says it's 60, with and identical config. Not sure why it says 60 there and only fails at 90 seconds, but anyway it's wrong.

Aug 12 '22 15:08 davystrong

@davystrong For azure.ai.ml.load_job(), please use the yaml with this lines: args: >- --run_invocation_timeout 1800

Aug 15 '22 06:08 bupt-wenxiaole

@davystrong for the "Not sure why it says 60 there and only fails at 90 seconds", it is expected, Parallel Run Step run time has a default system overhead, which is set to 30s, so the run failed at 30 + 60 = 90s.

Aug 15 '22 07:08 bupt-wenxiaole

@bupt-wenxiaole, thanks for the suggestion! Where exactly should I put that? If I add it to the root of my config it throws a validation exception; if I add it to the parallel job section or add --run_invocation_timeout 1800 to my arguments in the parallel job component (which I create separately using the Python SDK), the "Run invocation timeout" value in the parameters section doesn't change.

Aug 15 '22 12:08 davystrong

@davystrong , please upgrade the latest azure-ai-ml package and reference this example, and notes that add the --run_invocation_timeout 1800 to the "program_arguments".

Aug 30 '22 03:08 bupt-wenxiaole

Hi @davystrong. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text “/unresolve” to remove the “issue-addressed” label and continue the conversation.

Aug 30 '22 15:08 ghost

Hi @davystrong, since you haven’t asked that we “/unresolve” the issue, we’ll close this out. If you believe further discussion is needed, please add a comment “/unresolve” to reopen the issue.

Sep 06 '22 16:09 ghost

azure-sdk-for-python azure-sdk-for-python copied to clipboard

Azure AI ML: Job ignores timeout value

azure-sdk-for-python
azure-sdk-for-python copied to clipboard