cloudstack icon indicating copy to clipboard operation
cloudstack copied to clipboard

File-based disk-only VM snapshot with KVM as hypervisor

Open JoaoJandre opened this issue 9 months ago • 38 comments

Description

This PR implements the spec available at #9524. For more information regarding it, please read the spec.

Furthermore, the following changes that are not contemplated in the spec were added:

  1. The snapshot.merge.timeout agent property was added. It is only considered if libvirt.events.enabled is true;
  2. A new snapshot merge process (which affects normal volume snapshots and this feature) was created. When libvirt.events.enabled is true, ACS will register to gather events from Libvirt and will collect information on the process, providing a progress report in the logs. If the configuration is false, the old process is used;
  3. Volumes attached to VMs with file-based disk-only VM snapshots in KVM are able to be resized.

Types of changes

  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [X] New feature (non-breaking change which adds functionality)
  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [ ] Enhancement (improves an existing feature and functionality)
  • [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
  • [ ] build/CI
  • [ ] test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • [X] Major
  • [ ] Minor

Bug Severity

  • [ ] BLOCKER
  • [ ] Critical
  • [ ] Major
  • [ ] Minor
  • [ ] Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Basic Tests

I created a test VM to carry out the tests below. Additionally, after performing the relevant operations, the VM's XML and the storage were checked to observe if the snapshots existed.

Snapshot Creation

The tests below were also repeated with the VM stopped.

N Test Result
1 Take a snapshot of VM 1 without specifying quiesceVM Snapshot created
2 Take a snapshot of VM 2 specifying quiesceVM Snapshot created

Snapshot Reversion

N Test Result
1 Revert VM in Running state to any snapshot Error thrown
2 Revert VM in Stopped state to snapshot 1 and start it VM reverted and started successfully

Snapshot Removal

N Test Result
1 Create a new snapshot 3 after the second reversion test and delete snapshot 1 I verified that the snapshot was no longer listed and had the correct database metadata, the file still existed because more than one delta depended on it
2 Delete snapshot 2 Snapshot deleted; snapshot 1 was merged with snapshot 3 since it only had the latter as a dependency
3 Delete snapshot 3 (current) Snapshot removed, merged with the VM's volume
4 Create 3 snapshots and remove the first one Snapshot removed, merged with the second snapshot
5 Create two snapshots, revert to the first, and delete the second Snapshot deleted

Advanced Tests

Deletion Test

All tests were carried out with the VM stopped.

  1. I created 3 snapshots: s1, s2, and s3.
  2. I reverted the VM to snapshot s2.
  3. I created snapshot s4.
  4. I removed snapshot s2.

The snapshot was marked as hidden and was not removed from storage.

  1. I removed snapshot s3.

Snapshot s3 was removed normally. Snapshot s2 was merged with snapshot s4.

  1. I created snapshot s5.
  2. I reverted to snapshot s4.
  3. I removed snapshot s4.

Snapshot s4 was marked as hidden and was not removed from storage.

  1. I removed snapshot s5. Snapshot s5 was removed normally. Snapshot s4 was merged with the delta of the VM's volume.
  2. I removed the last remaining snapshot (s1). It was removed normally.

Reversion Test

  1. I created two snapshots: s1 and s2.
  2. I reverted to snapshot s1.
  3. I removed snapshot s1.

Snapshot s1 was marked as hidden and was not removed from storage.

  1. I reverted to snapshot s2. Snapshot s1 was merged with the base volume.

Concurrent Test

I created 4 VMs and took a VM snapshot of each. Then, I instructed to remove them all at the same time. All snapshots were removed simultaneously and successfully.

Test with Multiple Volumes

I created a VM with one datadisk and attached 8 more datadisks (10 volumes in total), took two VM snapshots, and then instructed to remove one at a time. The snapshots were removed successfully.

Tests Changing the snapshot.merge.timeout Config

  1. I changed the config to 1 and restarted the host;
  2. I created a VM, took a VM snapshot, accessed it, and wrote 4GB of data to it;
  3. I tried to remove the snapshot, an error occurred, and looking at the logs, it was possible to observe that it timed out;
  4. I manually aborted the blockcommit process;
  5. I changed the config to 0 and restarted the host;
  6. I tried to remove the snapshot, and it was performed correctly;

Tests Related to Volume Resize with Disk-Only VM Snapshots on KVM

Test Result Expected?
Create a VM, take a snapshot, resize the volume Resize performed successfully, both in metadata and when checked with qemu-img info Y
Stop the VM and revert the snapshot Revert performed successfully, volume size returned to original, both in metadata and qemu-img info Y
Remove the snapshot with the VM stopped The delta of the volume was correctly merged with the snapshot's, and the final size was that of the volume Y
Start the VM, take a new snapshot, resize the volume, and remove the snapshot Deltas were correctly merged, and the final size was that of the volume Y

The last two tests were repeated on a VM with several snapshots, so that a merge between snapshots was performed. The result was the same.

Tests Related to Events:

  1. Create VM, take disk-only VM snapshot, resize the root volume by 1GB more, stop the VM, revert the snapshot. It was observed through the cloud.usage_event table that the resize event was correctly triggered, and it was also observed via GUI that the account's resource limit was updated.
  2. Repeat the test above with a VM with two volumes, with only one resized. The test had the same result, and only one resize event was triggered, for the volume that had been resized.

JoaoJandre avatar Mar 28 '25 13:03 JoaoJandre

@blueorangutan package

JoaoJandre avatar Mar 28 '25 14:03 JoaoJandre

Codecov Report

Attention: Patch coverage is 9.41402% with 943 lines in your changes missing coverage. Please review.

Project coverage is 16.56%. Comparing base (6fdaf51) to head (44868f6). Report is 140 commits behind head on main.

Files with missing lines Patch % Lines
...napshot/KvmFileBasedStorageVmSnapshotStrategy.java 0.24% 414 Missing and 1 partial :warning:
...LibvirtCreateDiskOnlyVMSnapshotCommandWrapper.java 1.04% 95 Missing :warning:
...ervisor/kvm/resource/LibvirtComputingResource.java 43.93% 71 Missing and 3 partials :warning:
.../LibvirtMergeDiskOnlyVMSnapshotCommandWrapper.java 1.38% 71 Missing :warning:
...LibvirtRevertDiskOnlyVMSnapshotCommandWrapper.java 1.96% 50 Missing :warning:
...d/hypervisor/kvm/resource/BlockCommitListener.java 28.12% 22 Missing and 1 partial :warning:
...m/cloud/agent/api/storage/SnapshotMergeTreeTO.java 0.00% 21 Missing :warning:
...tack/storage/snapshot/DefaultSnapshotStrategy.java 0.00% 19 Missing :warning:
...java/org/apache/cloudstack/utils/qemu/QemuImg.java 0.00% 19 Missing :warning:
...nt/api/storage/MergeDiskOnlyVmSnapshotCommand.java 0.00% 18 Missing :warning:
... and 19 more
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #10632      +/-   ##
============================================
+ Coverage     16.41%   16.56%   +0.15%     
- Complexity    13629    14008     +379     
============================================
  Files          5702     5758      +56     
  Lines        503405   511750    +8345     
  Branches      60976    62242    +1266     
============================================
+ Hits          82626    84774    +2148     
- Misses       411594   417506    +5912     
- Partials       9185     9470     +285     
Flag Coverage Δ
uitests 3.91% <ø> (-0.09%) :arrow_down:
unittests 17.46% <9.41%> (+0.18%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Mar 28 '25 14:03 codecov[bot]

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

github-actions[bot] avatar Apr 28 '25 08:04 github-actions[bot]

@blueorangutan package

JoaoJandre avatar Apr 28 '25 16:04 JoaoJandre

@JoaoJandre a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar Apr 28 '25 16:04 blueorangutan

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13204

blueorangutan avatar Apr 28 '25 17:04 blueorangutan

@rohityadavcloud @sureshanaparti @weizhouapache could we run the CI?

JoaoJandre avatar Apr 28 '25 18:04 JoaoJandre

@blueorangutan test

DaanHoogland avatar Apr 29 '25 07:04 DaanHoogland

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

blueorangutan avatar Apr 29 '25 07:04 blueorangutan

[SF] Trillian test result (tid-13177) Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8 Total time taken: 54050 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10632-t13177-kvm-ol8.zip Smoke tests completed. 140 look OK, 1 have errors, 0 did not run Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_02_restore_vm_strict_tags_failure Failure 53.35 test_vm_strict_host_tags.py
test_02_scale_vm_strict_tags_failure Failure 54.75 test_vm_strict_host_tags.py
test_06_deploy_vm_on_any_host_with_strict_tags_failure Failure 4.69 test_vm_strict_host_tags.py

blueorangutan avatar Apr 29 '25 22:04 blueorangutan

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

github-actions[bot] avatar May 02 '25 10:05 github-actions[bot]

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

github-actions[bot] avatar May 12 '25 14:05 github-actions[bot]

@blueorangutan package

JoaoJandre avatar May 12 '25 16:05 JoaoJandre

@JoaoJandre a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar May 12 '25 16:05 blueorangutan

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13350

blueorangutan avatar May 12 '25 17:05 blueorangutan

@DaanHoogland could we rerun the CI here?

JoaoJandre avatar May 13 '25 20:05 JoaoJandre

@blueorangutan test

DaanHoogland avatar May 14 '25 05:05 DaanHoogland

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

blueorangutan avatar May 14 '25 06:05 blueorangutan

[SF] Trillian test result (tid-13301) Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8 Total time taken: 59704 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10632-t13301-kvm-ol8.zip Smoke tests completed. 140 look OK, 1 have errors, 0 did not run Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_deployVMInSharedNetwork Failure 67.34 test_network.py
ContextSuite context=TestSharedNetworkWithConfigDrive>:teardown Error 68.45 test_network.py

blueorangutan avatar May 14 '25 23:05 blueorangutan

@blueorangutan package

rohityadavcloud avatar Jun 10 '25 08:06 rohityadavcloud

@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar Jun 10 '25 08:06 blueorangutan

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13686

blueorangutan avatar Jun 10 '25 10:06 blueorangutan

@blueorangutan package

JoaoJandre avatar Jun 11 '25 12:06 JoaoJandre

@JoaoJandre a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar Jun 11 '25 12:06 blueorangutan

Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 13710

blueorangutan avatar Jun 11 '25 13:06 blueorangutan

@blueorangutan package

JoaoJandre avatar Jun 11 '25 16:06 JoaoJandre

@JoaoJandre a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar Jun 11 '25 16:06 blueorangutan

Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 13715

blueorangutan avatar Jun 11 '25 17:06 blueorangutan

Packaging result [SF]: ✔️ el8 ✔️ el9 ✖️ debian ✔️ suse15. SL-JID 13740

blueorangutan avatar Jun 12 '25 09:06 blueorangutan

@blueorangutan package

JoaoJandre avatar Jun 13 '25 11:06 JoaoJandre