meshery-performance-action icon indicating copy to clipboard operation
meshery-performance-action copied to clipboard

Add retries and confirmations to ensure CNCF runners and machines are removed.

Open gyohuangxin opened this issue 3 years ago • 15 comments

Description

There are some remaining CNCF runners not being remove after tests done, the number of them gradually increases over time. We can delete them manually, but it's better to make sure they are properly removed. image

The same thing happened to equinix servers deletion: image

Expected Behavior

We should add retries and confirmations to ensure CNCF runners and machines are removed.

Screenshots/Logs

Environment:

  • Meshery Version:
  • Kubernetes Version:
  • Host OS:
  • Browser:

gyohuangxin avatar Jul 15 '22 02:07 gyohuangxin

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 09 '22 01:09 stale[bot]

This issue is being automatically closed due to inactivity. However, you may choose to reopen this issue.

stale[bot] avatar Sep 21 '22 00:09 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Nov 12 '22 14:11 stale[bot]

This issue is being automatically closed due to inactivity. However, you may choose to reopen this issue.

stale[bot] avatar Nov 22 '22 21:11 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 07 '23 12:01 stale[bot]

Uh-oh. We do need to complete this item.

leecalcote avatar Jan 08 '23 18:01 leecalcote

It's possible to create machines on Equinix Metal in such a way that there's a termination time associated with them. See the "termination_time" field at

https://deploy.equinix.com/developers/api/metal/#tag/Devices/operation/createDevice

in the Equinix Metal API reference.

(That's not a substitute for cleanup, but it could backstop any other efforts if there's a bug somewhere else).

vielmetti avatar Feb 28 '23 20:02 vielmetti

There was a short-lived API outage yesterday, described at

https://status.equinixmetal.com/incidents/h30n2jlr5d3p

which may have impacted manual deletion of these systems. Please retry if you were affected by this. As of this writing, there are 48 systems deployed.

vielmetti avatar Mar 01 '23 15:03 vielmetti

@vielmetti I'm still facing the issue to access the management UI: image

gyohuangxin avatar Mar 01 '23 16:03 gyohuangxin

@gyohuangxin can you open up a ticket with our support team? I'll share your UI issue with the team, but it may be something specific to your account.

vielmetti avatar Mar 01 '23 17:03 vielmetti

@gyohuangxin Can you please task someone else on the project to assist you with cleaning up the idle and stranded resources while we sort out your access problems.

vielmetti avatar Mar 07 '23 12:03 vielmetti

The code that notices that a deprovision failed is here

https://github.com/layer5io/meshery-smp-action/blob/862c5283953f1b5a3a607c9e1f00461f98a4b4d5/.github/workflows/scripts/stop-cil-runner.sh#L19

It logs an error:

echo "ERROR: Failed to remove CNCF CIL machine: $hostname, device id: $device_id."

and then exits without retrying. If anything fails for any temporary reason, the machines will live forever until someone has manual attention.

Where does this error log go? If it's published somewhere we could look for patterns.

vielmetti avatar Mar 22 '23 15:03 vielmetti

@Revolyssup, will you please add this to tomorrow’s CI meeting? @edwvilla’s help here is much appreciated. Let’s ensure that we have a quick review and resolution. // @gyohuangxin

leecalcote avatar Mar 22 '23 15:03 leecalcote

All existing servers were manually deprovisioned today. A fresh batch of newly provisioned servers is running (now) from workflow schedule. Let's see if those servers are automatically deprovisioned on completion of their task.

leecalcote avatar Mar 23 '23 02:03 leecalcote

Yes, it seems that the test servers are successfully deprovisioned at end of test. 👍

leecalcote avatar Mar 23 '23 03:03 leecalcote