cloudstack icon indicating copy to clipboard operation
cloudstack copied to clipboard

Block use of internal and external snapshots on KVM

Open JoaoJandre opened this issue 5 months ago • 13 comments

Description

On KVM, there are two types of snapshots: internal and external. Most snapshot/backup solutions use external snapshots on ACS; save for disk-and-memory VM snapshots, which use internal snapshots (this is a limitation with KVM, as far as I know).

However, since internal snapshots are stored inside the VM's volume (hence the name), if an internal snapshot is taken after an external snapshot and the external snapshot is restored, the internal snapshot is lost.

Thus, this PR blocks the use of disk-and-memory VM snapshots alongside volume snapshots, NAS backups, and disk-only VM snapshots (at least the ones created using the default volume snapshot implementation).

I encourage maintainers of 3rd party storage providers to test if their implementation is compatible with disk-and-memory VM snapshots, if it is not it their simultaneous usage should be blocked.

Types of changes

  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] New feature (non-breaking change which adds functionality)
  • [X] Bug fix (non-breaking change which fixes an issue)
  • [ ] Enhancement (improves an existing feature and functionality)
  • [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
  • [ ] build/CI
  • [ ] test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • [ ] Major
  • [ ] Minor

Bug Severity

  • [ ] BLOCKER
  • [ ] Critical
  • [X] Major
  • [ ] Minor
  • [ ] Trivial

Screenshots (if appropriate):

How Has This Been Tested?

I created a VM and created a few disk-and-memory VM snapshots on it; then I tried to create NAS backups, volume snapshots and disk-only VM snapshots, all of them gave me an error, which is expected.

I validated that the opposite was also true for the aforementioned cases, e.g., create volume snapshot and try to create disk-and-memory VM snapshot.

I also validated that it was possible to create multiple NAS backups, disk-only VM snapshots and volume snapshots with no issues.

JoaoJandre avatar Jun 16 '25 18:06 JoaoJandre

@slavkap @rp- I think it would be interesting to validate if the implementations done for Storpool and Linstor are compatible with disk-and-memory VM snapshots.

JoaoJandre avatar Jun 16 '25 18:06 JoaoJandre

Codecov Report

:x: Patch coverage is 0% with 27 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 17.56%. Comparing base (6dc259c) to head (3095fb1). :warning: Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
.../storage/vmsnapshot/DefaultVMSnapshotStrategy.java 0.00% 15 Missing :warning:
...rg/apache/cloudstack/backup/NASBackupProvider.java 0.00% 5 Missing and 1 partial :warning:
...a/org/apache/cloudstack/backup/BackupProvider.java 0.00% 3 Missing :warning:
...tack/storage/snapshot/DefaultSnapshotStrategy.java 0.00% 3 Missing :warning:
Additional details and impacted files
@@            Coverage Diff            @@
##               main   #11039   +/-   ##
=========================================
  Coverage     17.55%   17.56%           
- Complexity    15535    15537    +2     
=========================================
  Files          5911     5912    +1     
  Lines        529359   529383   +24     
  Branches      64655    64660    +5     
=========================================
+ Hits          92949    92980   +31     
+ Misses       425952   425942   -10     
- Partials      10458    10461    +3     
Flag Coverage Δ
uitests 3.58% <ø> (ø)
unittests 18.63% <0.00%> (+<0.01%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Jun 16 '25 18:06 codecov[bot]

@blueorangutan package

JoaoJandre avatar Jun 16 '25 18:06 JoaoJandre

@JoaoJandre a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar Jun 16 '25 18:06 blueorangutan

Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 13798

blueorangutan avatar Jun 16 '25 19:06 blueorangutan

Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 13809

blueorangutan avatar Jun 17 '25 07:06 blueorangutan

@JoaoJandre :

08:35:48 [ERROR] /jenkins/workspace/acs-centos8-pkg-builder/dist/rpmbuild/BUILD/cloudstack-4.20.2.0-SNAPSHOT/engine/storage/snapshot/src/test/java/org/apache/cloudstack/storage/vmsnapshot/VMSnapshotStrategyKVMTest.java:32:8: Unused import - org.apache.cloudstack.backup.dao.BackupDao. [UnusedImports]

DaanHoogland avatar Jun 17 '25 08:06 DaanHoogland

@blueorangutan package

JoaoJandre avatar Jun 17 '25 16:06 JoaoJandre

@JoaoJandre a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar Jun 17 '25 16:06 blueorangutan

Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 13818

blueorangutan avatar Jun 17 '25 17:06 blueorangutan

@blueorangutan package

DaanHoogland avatar Jun 18 '25 07:06 DaanHoogland

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar Jun 18 '25 07:06 blueorangutan

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13826

blueorangutan avatar Jun 18 '25 08:06 blueorangutan

Linstor does currently not support memory snapshots (we check and throw an error if selected). So I guess we are currently not affected by any of this?

rp- avatar Jul 03 '25 08:07 rp-

@blueorangutan test

DaanHoogland avatar Jul 08 '25 15:07 DaanHoogland

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

blueorangutan avatar Jul 08 '25 15:07 blueorangutan

[SF] Trillian test result (tid-13723) Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8 Total time taken: 53428 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr11039-t13723-kvm-ol8.zip Smoke tests completed. 141 look OK, 0 have errors, 0 did not run Only failed and skipped tests results shown below:

Test Result Time (s) Test File

blueorangutan avatar Jul 09 '25 06:07 blueorangutan

@JoaoJandre this seems to be included in PR #10632 do you still want it in 4.20.2 ?

weizhouapache avatar Aug 28 '25 13:08 weizhouapache

@JoaoJandre this seems to be included in PR #10632 do you still want it in 4.20.2 ?

@weizhouapache PR #10632 blocks the usage of the feature introduced in #10632 and other incompatible features. This PR purposefully ignores #10632 and adds restrictions to avoid other interactions between internal and external snapshots; such as volume snapshot and disk-and-memory VM snapshot.

They are complementary. When merging this PR forward, care should be taken so that the validations of both PRs do not erase one another (I can make the merge forward if needed).

JoaoJandre avatar Aug 28 '25 13:08 JoaoJandre

@JoaoJandre this seems to be included in PR #10632 do you still want it in 4.20.2 ?

@weizhouapache PR #10632 blocks the usage of the feature introduced in #10632 and other incompatible features. This PR purposefully ignores #10632 and adds restrictions to avoid other interactions between internal and external snapshots; such as volume snapshot and disk-and-memory VM snapshot.

They are complementary. When merging this PR forward, care should be taken so that the validations of both PRs do not erase one another (I can make the merge forward if needed).

ok @JoaoJandre I think the best option might be re-target this PR to 4.22 which includes #10632 , to avoid re-work.

weizhouapache avatar Aug 28 '25 13:08 weizhouapache

aren't we talking 4.20.2 , @weizhouapache ?

DaanHoogland avatar Aug 29 '25 07:08 DaanHoogland

aren't we talking 4.20.2 , @weizhouapache ?

sorry, I meant 4.22, not 4.21

if we merge into 4.20.2, the merge forward to 4.22 will be a trouble , as @JoaoJandre mentioned unless we ignore this PR in merge forward, and @JoaoJandre create another PR against 4.22 (needs re-review and re-testing)

weizhouapache avatar Aug 29 '25 07:08 weizhouapache

@DaanHoogland @weizhouapache I rebased the changes so now I'm targeting main.

JoaoJandre avatar Sep 01 '25 18:09 JoaoJandre

@JoaoJandre thanks for the update, overall LGTM. left a small comment

weizhouapache avatar Sep 01 '25 18:09 weizhouapache

@JoaoJandre this is ready to ship? (asking because I see scattered test reports and am not sure of completeness)

DaanHoogland avatar Sep 08 '25 10:09 DaanHoogland

@DaanHoogland a new round of tests would be good since I rebased from 4.20 to main.

JoaoJandre avatar Sep 08 '25 11:09 JoaoJandre

@blueorangutan package

JoaoJandre avatar Sep 08 '25 11:09 JoaoJandre

@JoaoJandre a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar Sep 08 '25 11:09 blueorangutan

Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 14898

blueorangutan avatar Sep 08 '25 12:09 blueorangutan

@blueorangutan package

JoaoJandre avatar Sep 08 '25 14:09 JoaoJandre