manageiq Destroying a large EMS times out and never completes

Cascade destroy of a large provider can take a significant amount of time. The default queue timeout is 10 minutes which is way too short even for medium sized providers.

This results in ERROR -- evm: MIQ(MiqQueue#deliver) Message id: [239], timed out after 600.006065233 seconds. Timeout threshold [600] type errors on the orchestrate_destroy queue item.

Short term solution is to extend the msg_timeout for the orchestrate_destroy queue item to handle the majority of the cases. Longer term we should at a minimum run this in a thread with the main thread monitoring the operation and heartbeating at shorter intervals in order to handle the case where this can take many hours to delete.

Even longer term I'm thinking of spinning off a background job for this so it doesn't lock-up a generic worker for hours

Nov 29 '21 19:11 agrare

Do we know what exactly is taking so long in these cases though? Usually we've found it's some association with a dependent => destroy or dependent => nullify, which we can ignore and then have a proper purger.

Nov 30 '21 20:11 Fryguy

In this latest case I don't know. I've seen it with thinks like hardwares and operating systems that are linked off of containers or hosts.

Usually we've found it's some association with a dependent => destroy or dependent => nullify, which we can ignore and then have a proper purger.

This might be effective, but also seems like a hack/workaround for how poorly performing rails dependent destroy is. Unless we have a purger for every model that can grow to over a few thousand records we're going to have issues.

Nov 30 '21 20:11 agrare

This might be effective, but also seems like a hack/workaround for how poorly performing rails dependent destroy is. Unless we have a purger for every model that can grow to over a few thousand records we're going to have issues.

True - we may also be overly aggressive with destroy where a delete would suffice.

Nov 30 '21 21:11 Fryguy

This issue has been automatically marked as stale because it has not been updated for at least 3 months.

If you can still reproduce this issue on the current release or on master, please reply with all of the information you have about it in order to keep the issue open.

Thank you for all your contributions! More information about the ManageIQ triage process can be found in the triage process documentation.

Feb 27 '23 00:02 miq-bot

This issue has been automatically closed because it has not been updated for at least 3 months.

Feel free to reopen this issue if this issue is still valid.

Thank you for all your contributions! More information about the ManageIQ triage process can be found in the triage process documentation.

May 29 '23 00:05 miq-bot

manageiq manageiq copied to clipboard

Destroying a large EMS times out and never completes

manageiq
manageiq copied to clipboard