manageiq
manageiq copied to clipboard
EVM snapshots are removed by the wrong MIQ instance
We are using multiple MIQ environnments connected to the same vcenters. First MIQ instance is running smartstate analysis on a subset of VMs. The second MIQ instance does not execute smartstate process. When the VM snapshot is created by the first MIQ instance to perform smartstate analysis, the second one delete it before the end of smartstate process.
Vmware events screenshot show the problematic
Is there a way to disable EVM snapshots removing job or limiting the job only to VM managed by smartanalysis process ?
There is a schedule we have that runs regularly to clean up "old" EVM snapshots that are created by us. If we don't see the snapshot GUID in our database, then it gets removed.
@agrare Can you put more details here? I'm not sure we can actually fix this without storing something different.
This is a really interesting one, so the way the snap cleanup works is we schedule a Job.check_for_evm_snapshots
in the MiqScheduleWorker
We then, based on the snapshot name, check if the VmScan associated with that snapshot is active [ref] and if not we delete the snapshot.
If you have two different environments pointing to the same vCenter then we have no way of knowing if VmScans on the other environment are active since there are two completely separate databases.
I could see a number of possible ways of fixing this:
- Add an option to disable the scheduled snapshot cleanup
- Only run the scheduled snapshot cleanup if the SmartProxy role is enabled (you mentioned you only run ssa scans on the one environment so this would be a good option)
- Add in the MiqRegion GUID to the name of the snapshot so we wouldn't be deleting snapshots from other regions/environments, this has two downsides however in that if you remove and re-add MIQ we wouldn't clean up snaps from the old environment and we would still have to handle existing snapshots with the old format.
- You could create two credentials that have non-overlapping inventory scopes so that the two environments don't have overlapping visibility
I'm thinking #2
would be ideal but want to check if that works for you @lamm
2 has a positive side effect of just generally less processing for environments with SmartProxy disabled entirely.
I think that the second solution should be implemented because it is not useful to try to delete snaps if the role is not active and it consumes resources for nothing.
The first solution is useless if the second is implemented. Especially since that would mean you can enable SSA without ever deleting the snapshots.
In our use case, we need to enable SSA on multiple MIQ environments because each MIQ environment manages a separate set of VMs. Implementing solution 2 avoids the problem when a single MIQ environment executes the SSA process but not when SSA is executed by multiple MIQ environments.
We cannot implement solution 4 because the VMs are mixed in different VMWare clusters.
Regarding solution 3, it allows managing multiple SSA processes, but another solution would be to store the UID of the generated snapshot in the VMDB. When the purge process starts, it only processes snapshot UIDs saved in the VMDB. This avoids the drawback of solution 3.
but another solution would be to store the UID of the generated snapshot in the VMDB. When the purge process starts, it only processes snapshot UIDs saved in the VMDB
Well this would however prevent snapshots from previous MIQ environments from ever being cleaned up which is the same downside as storing the region GUID
In this case, perhaps the cleanup process can be based on the age of the snapshot when it belongs to another MIQ environment? Could this age be configurable in the MIQ advanced settings ?
@lamm we already have a evm_snapshot_interval
setting which is used when checking for active jobs by guid. It defaults to 1 hour but you could change that to be longer as a temporary workaround.
From my understanding, evm_snapshot_delete_delay_for_job_not_found is the time interval between two executions of the SSA snapshot cleanup job. But I don't really understand what the evm_snapshot_interval parameter does. Could you explain it to me?
Okay we will now check for an active smartstate role before queueing a Job.check_for_evm_snapshots task so if your second region does not have SSA enabled then we will not delete evm snapshots for that environment.
In the meantime you can extend the schedule_worker/evm_snapshot_delete_delay_for_job_not_found
setting to be significantly longer than the max duration of your SSA scans.
This issue has been automatically marked as stale because it has not been updated for at least 3 months.
If you can still reproduce this issue on the current release or on master
, please reply with all of the information you have about it in order to keep the issue open.
This issue has been automatically marked as stale because it has not been updated for at least 3 months.
If you can still reproduce this issue on the current release or on master
, please reply with all of the information you have about it in order to keep the issue open.
This issue has been automatically marked as stale because it has not been updated for at least 3 months.
If you can still reproduce this issue on the current release or on master
, please reply with all of the information you have about it in order to keep the issue open.