[Bug] Sometimes the `azure` backend hangs for very long on instance creation
Steps to reproduce:
- Run an instance using the
azurebackend
Reproduced only sometimes (I guess when there is no capacity but that's not sure)
Actual behavior:
- The run hangs as
submitted - The last server log is
{
"message": "Requesting Standard_NV6ads_A10_v5 spot instance in westeurope...",
"logger": "dstack._internal.core.backends.azure.compute",
"timestamp": "2024-06-24 13:16:29,022",
"level": "INFO"
}
- Stopping the run doesn't help
Notes:
The impact of this issue is unclear and yet to be confirmed. Certainly blocks the current user. Unsure if other users are also blocked.
AzureCompute.create_instance() hangs while waiting for the vm to be created here:
https://github.com/dstackai/dstack/blob/4a7a69127ff17727a15f7c6eff99b5940f9245e2/src/dstack/_internal/core/backends/azure/compute.py#L455
On Azure side, the vm stuck in the Creating state – that's why create_instance() never returns.
Should be fixed by setting timeout on poller.result().
We need to ensure all requests to clouds have timeouts set.
Also, consider updating job processing tasks so that the server can process more than one job/run in parallel to prevent one stuck job from blocking the processing of other jobs.
This issue is stale because it has been open for 30 days with no activity.