azurehpc icon indicating copy to clipboard operation
azurehpc copied to clipboard

NVIDIA #20210102.1 Pipeline Failure

Open xpillons opened this issue 4 years ago • 4 comments

  • rsync connection refused for ND40rs_v2 and Gen2 image https://azurecat.visualstudio.com/hpccat/_build/results?buildId=10393&view=logs&j=55fc7d3f-746a-55f6-bb80-32903d2b68f6&t=c44d176c-8c9a-532a-3ce7-9ebfde9ffc60&l=388

  • ssh connection timeout for Standard_NV12s_v3 and Gen1 image. Timeout is after updating the kernel, LIS is not installed. AZHPC should failed on step 4 while it failed on step 6 https://azurecat.visualstudio.com/hpccat/_build/results?buildId=10393&view=logs&j=40a7dfaa-edcf-57d7-da50-33204f1e0241&t=eef1fa0f-de1b-545a-8af2-256fc8a5c4c1&l=2655

xpillons avatar Jan 04 '21 09:01 xpillons

Manually reran the pipeline. Gen2 passed. Gen1 failed with error Resource : gpumaster - OSProvisioningTimedOut Message : OS Provisioning for VM 'gpumaster' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later. For details on how to check current provisioning state of Windows VMs, refer to https://aka.ms/WindowsVMLifecycle and Linux VMs, refer to https://aka.ms/LinuxVMLifecycle. None Allocating NV12s_v3 is taking too long

xpillons avatar Jan 04 '21 17:01 xpillons

@xpillons, got a similar failure today running the nvidia pipiline. https://azurecat.visualstudio.com/hpccat/_build/results?buildId=10563&view=logs&j=40a7dfaa-edcf-57d7-da50-33204f1e0241&t=eef1fa0f-de1b-545a-8af2-256fc8a5c4c1&l=280 The time difference between "build install scripts" and the rsync error was only 2 seconds. The error is a connection refused. I believe we already check thad sshd is running before trying to connect, but this does not fix the problem. If there is not a quick fix for this (i.e some additional flag), then maybe it would be worth the time to re-architect this (i.e. replace rsync with something else?). This type of error is occurring too often.

garvct avatar Jan 25 '21 15:01 garvct

@edwardsp can you have a look to check why the prsync is failing ? I can see in the code that ssh is tested upfront, but I'm not 100% sure about the sequence. Otherwise maybe we should add a retry in the rsyn python wrapper function

xpillons avatar Jan 28 '21 10:01 xpillons

ssh isn't tested before the initial rsync so I have just added a PR to add a test for ssh.

edwardsp avatar Jan 28 '21 13:01 edwardsp