coreos-assembler icon indicating copy to clipboard operation
coreos-assembler copied to clipboard

platform/azure/instance.go: make CreateOrUpdate timeout aware

Open saqibali-2k opened this issue 2 years ago • 2 comments

In a recent test run in FCOS CI, we noticed function CreateOrUpdate hung and took 30 minutes to terminate. Let's gate this piece of code with a timeout.

saqibali-2k avatar Jun 07 '22 21:06 saqibali-2k

/retest

dustymabe avatar Jun 08 '22 01:06 dustymabe

Saw something similar today in https://jenkins-fedora-coreos-pipeline.apps.ocp.fedoraproject.org/job/kola-azure/142/.

[2022-07-11T21:30:09.575Z] --- PASS: rpmostree.upgrade-rollback (791.77s)
[2022-07-11T21:30:09.575Z]     --- PASS: rpmostree.upgrade-rollback/upgrade (205.16s)
[2022-07-11T21:30:09.575Z]     --- PASS: rpmostree.upgrade-rollback/rollback (172.31s)
[2022-07-11T21:43:46.179Z] 2022-07-11T21:43:41Z kola: retryloop: failed to bring up machines: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=200 -- Original Error: Long running operation terminated with status 'Failed': Code="OSProvisioningTimedOut" Message="OS Provisioning for VM 'kola-b5d64948-0851c4825e' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later. Also, make sure the image has been properly prepared (generalized).\r\n * Instructions for Windows: https://azure.microsoft.com/documentation/articles/virtual-machines-windows-upload-image/ \r\n * Instructions for Linux: https://azure.microsoft.com/documentation/articles/virtual-machines-linux-capture-image/ \r\n * If you are deploying more than 20 Virtual Machines concurrently, consider moving your custom image to shared image gallery. Please refer to https://aka.ms/movetosig for the same."
[2022-07-11T21:50:43.948Z] Sending interrupt signal to process

Let's get together soon to get your Azure set up so we can test this PR and get it in.

dustymabe avatar Jul 12 '22 02:07 dustymabe

Ok for posterity the original error we saw was in kola-azure#57 (link now expired) from April 30th:

harness.go:1346: Cluster failed: creating storage account:
storage.AccountsClient#Create: Failure sending request: StatusCode=504 -- 
Original Error: Long running operation terminated with status 'Failed':
Code="GatewayTimeout" Message="The gateway did not receive a response from Microsoft.Storage' within the specified time period."

Just saw another long running operation today (kola-azure#50) that took a while to time out and caused the whole job to timeout at the jenkins level:

kola: retryloop: failed to bring up machines: compute.VirtualMachinesClient#CreateOrUpdate: 
Failure sending request: StatusCode=200 --
Original Error: Long running operation terminated with status
'Failed': Code="OSProvisioningTimedOut" Message="OS Provisioning for VM 'kola-48435b97-33c94b1521' did not finish in the allotted time.
The VM may still finish provisioning successfully. Please check provisioning state later.
Also, make sure the image has been properly prepared (generalized).\r\n * Instructions
for Windows: https://azure.microsoft.com/documentation/articles/virtual-machines-windows-upload-image/
\r\n * Instructions for Linux: https://azure.microsoft.com/documentation/articles/virtual-machines-linux-capture-image/
\r\n * If you are deploying more than 20 Virtual Machines concurrently, consider moving your custom
image to shared image gallery. Please refer to https://aka.ms/movetosig for the same."

I'll try to resolve remaining issues with this PR and test it so we can get it merged.

dustymabe avatar Sep 06 '22 23:09 dustymabe

and in kola-azure#52 we saw some similar issues that caused the job to eventually time out at the jenkins level:

[2022-09-06T21:54:03.396Z] ⚠️  Skipping kola test pattern "fcos.internet":
[2022-09-06T21:54:03.396Z]   👉 https://github.com/coreos/coreos-assembler/pull/1478
[2022-09-06T21:54:03.396Z] ⚠️  Skipping kola test pattern "podman.workflow":
[2022-09-06T21:54:03.396Z]   👉 https://github.com/coreos/coreos-assembler/pull/1478
[2022-09-06T21:54:03.396Z] === RUN   coreos.selinux.boolean
[2022-09-06T21:54:03.396Z] === RUN   coreos.selinux.enforce
[2022-09-06T21:54:03.396Z] === RUN   ext.config.gshadow
[2022-09-06T21:54:03.396Z] === RUN   fcos.metadata.azure
[2022-09-06T21:54:03.396Z] === RUN   fcos.filesystem
[2022-09-06T21:54:03.396Z] === RUN   fcos.users.shells
[2022-09-06T21:54:03.396Z] === RUN   basic
[2022-09-06T21:54:03.396Z] === RUN   rootfs.uuid
[2022-09-06T21:54:03.396Z] === RUN   coreos.tls.fetch-urls
[2022-09-06T21:54:03.396Z] === RUN   ostree.remote
[2022-09-06T21:54:03.396Z] === RUN   rhcos.selinux.boolean.persist
[2022-09-06T21:54:03.396Z] === RUN   ext.config.firewall.iptables-legacy
[2022-09-06T21:54:03.396Z] === RUN   ostree.hotfix
[2022-09-06T21:54:03.396Z] === RUN   rpmostree.install-uninstall
[2022-09-06T21:54:03.396Z] === RUN   ext.config.swap.zram-generator
[2022-09-06T21:54:03.396Z] === RUN   fcos.ignition.misc.empty
[2022-09-06T21:54:03.396Z] === RUN   coreos.ignition.groups
[2022-09-06T21:54:03.396Z] === RUN   coreos.ignition.security.tls
[2022-09-06T21:54:03.396Z] === RUN   ext.config.kdump.crash
[2022-09-06T21:54:03.396Z] === RUN   coreos.ignition.resource.local
[2022-09-06T21:54:03.396Z] === RUN   fcos.network.listeners
[2022-09-06T21:54:03.396Z] === RUN   coreos.ignition.v2.users
[2022-09-06T21:54:03.396Z] === RUN   podman.base
[2022-09-06T21:54:03.396Z] === RUN   podman.network-single
[2022-09-06T21:54:03.396Z] === RUN   coreos.ignition.once
[2022-09-06T21:54:03.396Z] === RUN   fcos.ignition.v3.noop
[2022-09-06T21:54:03.396Z] === RUN   ext.config.ignition.resource.remote
[2022-09-06T21:54:03.396Z] === RUN   coreos.auth.verify
[2022-09-06T21:54:03.396Z] === RUN   ext.config.ignition.remote
[2022-09-06T21:54:03.396Z] === RUN   coreos.ignition.instantiated.enable-unit
[2022-09-06T21:54:03.396Z] === RUN   rpmostree.status
[2022-09-06T21:54:03.396Z] === RUN   ostree.unlock
[2022-09-06T21:54:03.396Z] === RUN   non-exclusive-test-bucket-0
[2022-09-06T21:54:03.396Z] === RUN   coreos.ignition.journald-log
[2022-09-06T21:54:03.396Z] === RUN   coreos.ignition.ssh.key
[2022-09-06T21:54:03.396Z] === RUN   ext.config.ignition.kargs
[2022-09-06T21:54:03.396Z] === RUN   rpmostree.upgrade-rollback
[2022-09-06T22:18:10.905Z] 2022-09-06T22:18:01Z kola: retryloop: failed to bring up machines: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Get "https://management.azure.com/subscriptions/8ddde9ba-f5e7-4a4f-a9a1-64181b250697/providers/Microsoft.Compute/locations/eastus/operations/8f40c57d-8041-4db8-ae81-dbf5e70fa8e6?p=90b312a9-a01c-4368-a446-25dd0f6f1f5c&api-version=2016-04-30-preview": read tcp 10.131.1.23:56386->20.40.227.150:443: read: connection timed out
[2022-09-06T22:18:10.905Z] 2022-09-06T22:18:01Z kola: retryloop: failed to bring up machines: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Get "https://management.azure.com/subscriptions/8ddde9ba-f5e7-4a4f-a9a1-64181b250697/providers/Microsoft.Compute/locations/eastus/operations/77d91752-7c4d-4e9b-8fb9-4ea07b5ee94c?p=90b312a9-a01c-4368-a446-25dd0f6f1f5c&api-version=2016-04-30-preview": read tcp 10.131.1.23:56386->20.40.227.150:443: read: connection timed out
[2022-09-06T22:18:10.905Z] 2022-09-06T22:18:01Z kola: retryloop: failed to bring up machines: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Get "https://management.azure.com/subscriptions/8ddde9ba-f5e7-4a4f-a9a1-64181b250697/providers/Microsoft.Compute/locations/eastus/operations/1160eae0-552d-4ac7-b593-033407cf4ee1?p=90b312a9-a01c-4368-a446-25dd0f6f1f5c&api-version=2016-04-30-preview": read tcp 10.131.1.23:56386->20.40.227.150:443: read: connection timed out
[2022-09-06T22:18:10.905Z] 2022-09-06T22:18:01Z kola: retryloop: failed to bring up machines: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Get "https://management.azure.com/subscriptions/8ddde9ba-f5e7-4a4f-a9a1-64181b250697/providers/Microsoft.Compute/locations/eastus/operations/a8a59671-41c0-4901-b938-bf8c629766f9?p=90b312a9-a01c-4368-a446-25dd0f6f1f5c&api-version=2016-04-30-preview": read tcp 10.131.1.23:56386->20.40.227.150:443: read: connection timed out
[2022-09-06T22:18:10.906Z] 2022-09-06T22:18:01Z kola: retryloop: failed to bring up machines: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Get "https://management.azure.com/subscriptions/8ddde9ba-f5e7-4a4f-a9a1-64181b250697/providers/Microsoft.Compute/locations/eastus/operations/1da6017f-53f7-4b88-8dcf-e31cd77df8da?p=90b312a9-a01c-4368-a446-25dd0f6f1f5c&api-version=2016-04-30-preview": read tcp 10.131.1.23:56386->20.40.227.150:443: read: connection timed out
[2022-09-06T22:18:10.906Z] 2022-09-06T22:18:01Z kola: retryloop: failed to bring up machines: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Get "https://management.azure.com/subscriptions/8ddde9ba-f5e7-4a4f-a9a1-64181b250697/providers/Microsoft.Compute/locations/eastus/operations/6d3101f4-5bb5-4468-ba93-123fbe991c9f?p=90b312a9-a01c-4368-a446-25dd0f6f1f5c&api-version=2016-04-30-preview": read tcp 10.131.1.23:56386->20.40.227.150:443: read: connection timed out
[2022-09-06T22:18:10.906Z] 2022-09-06T22:18:01Z kola: retryloop: failed to bring up machines: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Get "https://management.azure.com/subscriptions/8ddde9ba-f5e7-4a4f-a9a1-64181b250697/providers/Microsoft.Compute/locations/eastus/operations/174b7d35-799d-4ae4-9908-bbcd79021892?p=90b312a9-a01c-4368-a446-25dd0f6f1f5c&api-version=2016-04-30-preview": read tcp 10.131.1.23:56386->20.40.227.150:443: read: connection timed out
[2022-09-06T22:18:10.906Z] 2022-09-06T22:18:01Z kola: retryloop: failed to bring up machines: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Get "https://management.azure.com/subscriptions/8ddde9ba-f5e7-4a4f-a9a1-64181b250697/providers/Microsoft.Compute/locations/eastus/operations/0b9f0901-68ec-4899-b90a-5c9e3a320c8b?p=90b312a9-a01c-4368-a446-25dd0f6f1f5c&api-version=2016-04-30-preview": read tcp 10.131.1.23:56386->20.40.227.150:443: read: connection timed out
[2022-09-06T22:18:10.906Z] 2022-09-06T22:18:01Z kola: retryloop: failed to bring up machines: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Get "https://management.azure.com/subscriptions/8ddde9ba-f5e7-4a4f-a9a1-64181b250697/providers/Microsoft.Compute/locations/eastus/operations/5197a407-e905-4479-ace6-e4b59609a17d?p=90b312a9-a01c-4368-a446-25dd0f6f1f5c&api-version=2016-04-30-preview": read tcp 10.131.1.23:56386->20.40.227.150:443: read: connection timed out
[2022-09-06T22:18:10.906Z] 2022-09-06T22:18:01Z kola: retryloop: failed to bring up machines: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Get "https://management.azure.com/subscriptions/8ddde9ba-f5e7-4a4f-a9a1-64181b250697/providers/Microsoft.Compute/locations/eastus/operations/fc2dcf32-0c1b-4f61-9f00-e00bc47d0708?p=90b312a9-a01c-4368-a446-25dd0f6f1f5c&api-version=2016-04-30-preview": read tcp 10.131.1.23:56386->20.40.227.150:443: read: connection timed out

So you can see it took the API calls about 24 minutes to time out.

dustymabe avatar Sep 06 '22 23:09 dustymabe

I verified basic functionality here. I didn't try to artificially reproduce a timeout failure. Hoping we can ship this anyway.

dustymabe avatar Sep 06 '22 23:09 dustymabe

Tested with a super short timeout of 10s and it seems to be working (I think):

=== RUN   non-exclusive-test-bucket-0
2022-09-07T03:00:44Z kola: retryloop: failed to bring up machines: creating instance failed: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest: AfterDelay canceled before full delay
2022-09-07T03:00:57Z kola: retryloop: failed to bring up machines: creating instance failed: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest: AfterDelay canceled before full delay

dustymabe avatar Sep 07 '22 03:09 dustymabe