azure-sdk-for-python
azure-sdk-for-python copied to clipboard
Azure AI ML: Job ignores timeout value
- Package Name: azure-ai-ml
- Package Version: 0.1.0b6
- Operating System: macOS 12.5
- Python Version: 3.8.12
Describe the bug
I have an Azure ML pipeline job defined in YAML. It consists of a single parallel run step which has a timeout of 1800 seconds. When I create a job with the not-even-public-preview CLI v2, it works as expected. However, when I submit it using the Python SDK v2, using either azure.ai.ml.load_job()
or manually defining the pipeline in Python code, the timeout is ignored and each call to run()
is cancelled after 90 seconds (the default).
To Reproduce Steps to reproduce the behavior:
- Create a pipeline with a parallel run step which lasts more than 90 seconds
- Submit the pipeline job using
ml_client.jobs.create_or_update()
- Check the Portal logs to see it fail
Expected behavior The timeout should be set to 1800 seconds.
Additional context
If I print the object I pass to ml_client.jobs.create_or_update()
it includes a line with timeout: 1800
. If I open logs/sys/error/0/process000.txt
in the portal logs it includes "run_method_duration": 90.25895500183105, "status": "RUN_TIMEOUT"
.
Label prediction was below confidence level 0.6
for Model:ServiceLabels
: 'Data Factory:0.49535987,Event Hubs:0.049816728,Cognitive - Form Recognizer:0.022198211'
Thank you for your feedback. This has been routed to the support team for assistance.
@azureml-github
One further detail: if I submit the run via the CLI, open it in the portal and got to Parameters>Run settings, it says that "Run invocation timeout" is 1800. If I submit it via the SDK it says it's 60, with and identical config. Not sure why it says 60 there and only fails at 90 seconds, but anyway it's wrong.
@davystrong For azure.ai.ml.load_job(), please use the yaml with this lines: args: >- --run_invocation_timeout 1800
@davystrong for the "Not sure why it says 60 there and only fails at 90 seconds", it is expected, Parallel Run Step run time has a default system overhead, which is set to 30s, so the run failed at 30 + 60 = 90s.
@bupt-wenxiaole, thanks for the suggestion! Where exactly should I put that? If I add it to the root of my config it throws a validation exception; if I add it to the parallel job section or add --run_invocation_timeout 1800
to my arguments in the parallel job component (which I create separately using the Python SDK), the "Run invocation timeout" value in the parameters section doesn't change.
@davystrong , please upgrade the latest azure-ai-ml package and reference this example, and notes that add the --run_invocation_timeout 1800 to the "program_arguments".
Hi @davystrong. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text “/unresolve
” to remove the “issue-addressed” label and continue the conversation.
Hi @davystrong, since you haven’t asked that we “/unresolve
” the issue, we’ll close this out. If you believe further discussion is needed, please add a comment “/unresolve
” to reopen the issue.