dstack icon indicating copy to clipboard operation
dstack copied to clipboard

[Bug] Sometimes the `azure` backend hangs for very long on instance creation

Open peterschmidt85 opened this issue 1 year ago • 2 comments

Steps to reproduce:

  1. Run an instance using the azure backend

Reproduced only sometimes (I guess when there is no capacity but that's not sure)

Actual behavior:

  1. The run hangs as submitted
  2. The last server log is
{
    "message": "Requesting Standard_NV6ads_A10_v5 spot instance in westeurope...",
    "logger": "dstack._internal.core.backends.azure.compute",
    "timestamp": "2024-06-24 13:16:29,022",
    "level": "INFO"
}
  1. Stopping the run doesn't help

Notes:

The impact of this issue is unclear and yet to be confirmed. Certainly blocks the current user. Unsure if other users are also blocked.

peterschmidt85 avatar Jun 24 '24 13:06 peterschmidt85

AzureCompute.create_instance() hangs while waiting for the vm to be created here:

https://github.com/dstackai/dstack/blob/4a7a69127ff17727a15f7c6eff99b5940f9245e2/src/dstack/_internal/core/backends/azure/compute.py#L455

On Azure side, the vm stuck in the Creating state – that's why create_instance() never returns.

Should be fixed by setting timeout on poller.result().

We need to ensure all requests to clouds have timeouts set.

r4victor avatar Jun 24 '24 13:06 r4victor

Also, consider updating job processing tasks so that the server can process more than one job/run in parallel to prevent one stuck job from blocking the processing of other jobs.

r4victor avatar Jun 24 '24 13:06 r4victor

This issue is stale because it has been open for 30 days with no activity.

peterschmidt85 avatar Jul 25 '24 01:07 peterschmidt85