cloudstack icon indicating copy to clipboard operation
cloudstack copied to clipboard

[WIP] Remove allocated snapshots / vm snapshots on start

Open sureshanaparti opened this issue 1 year ago • 13 comments

Description

Some latest active snapshots / vm snapshots are stuck in allocated state when MS is stopped, these are listed / shown in UI as well (not allowed to delete). Remove them on MS start itself.

This PR removes allocated snapshots / vm snapshots on start.

Fixes #8424

Types of changes

  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] New feature (non-breaking change which adds functionality)
  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [x] Enhancement (improves an existing feature and functionality)
  • [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
  • [ ] build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • [ ] Major
  • [x] Minor

Bug Severity

  • [ ] BLOCKER
  • [ ] Critical
  • [ ] Major
  • [ ] Minor
  • [ ] Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Manually tested: Take VM Snapshot -> Stop MS -> Start MS

VM Snapshot record after MS stopped =>

                 id: 1
               uuid: 07542708-27b7-47c4-959a-f7f92828f43a
               name: i-2-3-VM_VS_20240108075812
       display_name: testvm01-snap
        description: NULL
              vm_id: 3
         account_id: 2
          domain_id: 1
service_offering_id: 1
   vm_snapshot_type: DiskAndMemory
              state: Allocated
             parent: NULL
            current: NULL
       update_count: 0
            updated: NULL
            created: 2024-01-08 07:58:12
            removed: NULL

VM Snapshot record after MS started (Not listed / shown in the UI) =>

                 id: 1
               uuid: 07542708-27b7-47c4-959a-f7f92828f43a
               name: i-2-3-VM_VS_20240108075812
       display_name: testvm01-snap
        description: NULL
              vm_id: 3
         account_id: 2
          domain_id: 1
service_offering_id: 1
   vm_snapshot_type: DiskAndMemory
              state: Allocated
             parent: NULL
            current: NULL
       update_count: 0
            updated: NULL
            created: 2024-01-08 07:58:12
            removed: 2024-01-08 07:59:26

How did you try to break this feature and the system with this change?

sureshanaparti avatar Jan 05 '24 07:01 sureshanaparti

@blueorangutan package

sureshanaparti avatar Jan 05 '24 07:01 sureshanaparti

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar Jan 05 '24 07:01 blueorangutan

Codecov Report

:x: Patch coverage is 1.44928% with 68 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 16.17%. Comparing base (253ac03) to head (56f5195). :warning: Report is 13 commits behind head on 4.20.

Files with missing lines Patch % Lines
...stack/framework/jobs/impl/AsyncJobManagerImpl.java 2.70% 36 Missing :warning:
...tack/storage/snapshot/SnapshotDataFactoryImpl.java 0.00% 11 Missing and 1 partial :warning:
.../storage/vmsnapshot/DefaultVMSnapshotStrategy.java 0.00% 8 Missing :warning:
.../storage/vmsnapshot/ScaleIOVMSnapshotStrategy.java 0.00% 8 Missing :warning:
...a/com/cloud/vm/snapshot/VMSnapshotManagerImpl.java 0.00% 4 Missing :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##               4.20    #8452      +/-   ##
============================================
- Coverage     16.17%   16.17%   -0.01%     
- Complexity    13297    13298       +1     
============================================
  Files          5656     5656              
  Lines        498136   498218      +82     
  Branches      60432    60451      +19     
============================================
+ Hits          80584    80585       +1     
- Misses       408584   408659      +75     
- Partials       8968     8974       +6     
Flag Coverage Δ
uitests 4.00% <ø> (+<0.01%) :arrow_up:
unittests 17.02% <1.44%> (-0.01%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Jan 05 '24 07:01 codecov[bot]

@sureshanaparti would it be better to clean the snapshots by storage garbage collector which run periodically ?

weizhouapache avatar Jan 05 '24 08:01 weizhouapache

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8207

blueorangutan avatar Jan 05 '24 08:01 blueorangutan

@sureshanaparti would it be better to clean the snapshots by storage garbage collector which run periodically ?

@weizhouapache issue is some active snapshots / vm snapshots are left in allocated when MS is stopped, no point in keeping them and wait until storage garbage collector cleans up (what if storage cleanup is disabled?) as these are listed / shown in UI as well (which are not allowed to delete). So, I think, it's better remove them on start itself. Any other thoughts / suggestions?

sureshanaparti avatar Jan 08 '24 05:01 sureshanaparti

@sureshanaparti would it be better to clean the snapshots by storage garbage collector which run periodically ?

@weizhouapache issue is some active snapshots / vm snapshots are left in allocated when MS is stopped, no point in keeping them and wait until storage garbage collector cleans up (what if storage cleanup is disabled?) as these are listed / shown in UI as well (which are not allowed to delete). So, I think, it's better remove them on start itself. Any other thoughts / suggestions?

@sureshanaparti the storage garbage collector cleans up the primary and secondary storage, including the snapshot/volumes/templates

                    // remove snapshots in Error state
                    List<SnapshotVO> snapshots = _snapshotDao.listAllByStatus(Snapshot.State.Error);

We could add the cleanup of Allocated snapshot and Allocated/Error vm snapshots in the process

IMHO, it is not the right place to cleanup during mgmt server start, as mgmt server might be running for several days.

weizhouapache avatar Jan 08 '24 08:01 weizhouapache

@sureshanaparti would it be better to clean the snapshots by storage garbage collector which run periodically ?

@weizhouapache issue is some active snapshots / vm snapshots are left in allocated when MS is stopped, no point in keeping them and wait until storage garbage collector cleans up (what if storage cleanup is disabled?) as these are listed / shown in UI as well (which are not allowed to delete). So, I think, it's better remove them on start itself. Any other thoughts / suggestions?

@sureshanaparti the storage garbage collector cleans up the primary and secondary storage, including the snapshot/volumes/templates

                    // remove snapshots in Error state
                    List<SnapshotVO> snapshots = _snapshotDao.listAllByStatus(Snapshot.State.Error);

We could add the cleanup of Allocated snapshot and Allocated/Error vm snapshots in the process

IMHO, it is not the right place to cleanup during mgmt server start, as mgmt server might be running for several days.

@weizhouapache Agreed for the resources in error state (as there might be things to reset / cleanup) and MS is running. snapshot / vm snapshots are in Allocated, mostly for the very recent active snapshot / vm snapshots before MS is stopped, so no need for any cleanup, better to set them as removed (Any event through state transition would still keep it in Allocated). Otherwise, at least have to not list them on UI, or allow explicit deletion from the UI.

Snapshots in destroying state are deleted here. So, I thought, it's better to remove there on MS start. https://github.com/apache/cloudstack/blob/3f9dd4dc07ff40a4e0216014715e4fe37fc46d28/server/src/main/java/com/cloud/storage/snapshot/SnapshotManagerImpl.java#L1519-L1523

sureshanaparti avatar Jan 08 '24 08:01 sureshanaparti

@weizhouapache issue is some active snapshots / vm snapshots are left in allocated when MS is stopped, no point in keeping them and wait until storage garbage collector cleans up (what if storage cleanup is disabled?) as these are listed / shown in UI as well (which are not allowed to delete). So, I think, it's better remove them on start itself. Any other thoughts / suggestions?

@sureshanaparti the storage garbage collector cleans up the primary and secondary storage, including the snapshot/volumes/templates

                    // remove snapshots in Error state
                    List<SnapshotVO> snapshots = _snapshotDao.listAllByStatus(Snapshot.State.Error);

We could add the cleanup of Allocated snapshot and Allocated/Error vm snapshots in the process IMHO, it is not the right place to cleanup during mgmt server start, as mgmt server might be running for several days.

@weizhouapache Agreed for the resources in error state (as there might be things to reset / cleanup) and MS is running. snapshot / vm snapshots are in Allocated, mostly for the very recent active snapshot / vm snapshots before MS is stopped, so no need for any cleanup, better to set them as removed (Any event through state transition would still keep it in Allocated). Otherwise, at least have to not list them on UI, or allow explicit deletion from the UI.

Snapshots in destroying state are deleted here. So, I thought, it's better to remove there on MS start.

https://github.com/apache/cloudstack/blob/3f9dd4dc07ff40a4e0216014715e4fe37fc46d28/server/src/main/java/com/cloud/storage/snapshot/SnapshotManagerImpl.java#L1519-L1523

@sureshanaparti the mgmt service is a key service for some users, especially on production. the service is running for several days without stopping. IMHO it is not good to cleanup resource during start ...

(OK with cleaning up resource in both start process and garbage collector)

weizhouapache avatar Jan 08 '24 08:01 weizhouapache

@weizhouapache issue is some active snapshots / vm snapshots are left in allocated when MS is stopped, no point in keeping them and wait until storage garbage collector cleans up (what if storage cleanup is disabled?) as these are listed / shown in UI as well (which are not allowed to delete). So, I think, it's better remove them on start itself. Any other thoughts / suggestions?

@sureshanaparti the storage garbage collector cleans up the primary and secondary storage, including the snapshot/volumes/templates

                    // remove snapshots in Error state
                    List<SnapshotVO> snapshots = _snapshotDao.listAllByStatus(Snapshot.State.Error);

We could add the cleanup of Allocated snapshot and Allocated/Error vm snapshots in the process IMHO, it is not the right place to cleanup during mgmt server start, as mgmt server might be running for several days.

@weizhouapache Agreed for the resources in error state (as there might be things to reset / cleanup) and MS is running. snapshot / vm snapshots are in Allocated, mostly for the very recent active snapshot / vm snapshots before MS is stopped, so no need for any cleanup, better to set them as removed (Any event through state transition would still keep it in Allocated). Otherwise, at least have to not list them on UI, or allow explicit deletion from the UI. Snapshots in destroying state are deleted here. So, I thought, it's better to remove there on MS start. https://github.com/apache/cloudstack/blob/3f9dd4dc07ff40a4e0216014715e4fe37fc46d28/server/src/main/java/com/cloud/storage/snapshot/SnapshotManagerImpl.java#L1519-L1523

@sureshanaparti the mgmt service is a key service for some users, especially on production. the service is running for several days without stopping. IMHO it is not good to cleanup resource during start ...

(OK with cleaning up resource in both start process and garbage collector)

ok, will check & update. thanks @weizhouapache

sureshanaparti avatar Jan 08 '24 10:01 sureshanaparti

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

github-actions[bot] avatar Feb 08 '24 13:02 github-actions[bot]

@sureshanaparti , does this need rebasing?

DaanHoogland avatar Jun 18 '24 08:06 DaanHoogland

on another note, is it maybe safe to first mark it as State == Error and then delete? (I think this is the same @weizhouapache suggests)

DaanHoogland avatar Jun 18 '24 08:06 DaanHoogland

@sureshanaparti , any progress on this?

DaanHoogland avatar Feb 03 '25 14:02 DaanHoogland

@blueorangutan package

DaanHoogland avatar Feb 04 '25 14:02 DaanHoogland

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar Feb 04 '25 14:02 blueorangutan

Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 12329

blueorangutan avatar Feb 04 '25 14:02 blueorangutan

@sureshanaparti are you working on this?

rohityadavcloud avatar Feb 28 '25 04:02 rohityadavcloud

@blueorangutan package

sureshanaparti avatar Aug 26 '25 10:08 sureshanaparti

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar Aug 26 '25 10:08 blueorangutan

Packaging result [SF]: ✔️ el8 ✔️ el9 ✖️ debian ✔️ suse15. SL-JID 14733

blueorangutan avatar Aug 26 '25 10:08 blueorangutan

@blueorangutan package

weizhouapache avatar Sep 08 '25 15:09 weizhouapache

@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar Sep 08 '25 15:09 blueorangutan

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 14904

blueorangutan avatar Sep 08 '25 16:09 blueorangutan

is it ready for review/testing @sureshanaparti ?

weizhouapache avatar Sep 10 '25 07:09 weizhouapache

is it ready for review/testing @sureshanaparti ?

yes @weizhouapache

sureshanaparti avatar Sep 10 '25 08:09 sureshanaparti

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

github-actions[bot] avatar Sep 11 '25 16:09 github-actions[bot]

@blueorangutan package

sureshanaparti avatar Sep 12 '25 14:09 sureshanaparti

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar Sep 12 '25 14:09 blueorangutan

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 14985

blueorangutan avatar Sep 12 '25 15:09 blueorangutan