cloudstack
cloudstack copied to clipboard
Fix delete parent snapshot
Description
ACS + Xenserver works with differential snapshots. ACS takes a volume full snapshot and the next ones are referenced as a child of the previous snapshot until the chain reaches the limit defined in the global setting snapshot.delta.max
; then, a new full snapshot is taken. PR #5297 introduced disk-only snapshots for KVM volumes. Among the changes, the delete process was also refactored. Before the changes, when one was removing a snapshot with children, ACS was marking it as Destroyed
and it was keeping the Image
entry on the table cloud.snapshot_store_ref
as Ready
. When ACS was rotating the snapshots (the max delta was reached) and all the children were already marked as removed; then, ACS would start removing the whole hierarchy, completing the differential snapshot cycle. After the changes, the snapshots with children stopped being marked as removed and the differential snapshot cycle was not being completed.
This PR intends to honor again the differential snapshot cycle for XenServer, making the snapshots to be marked as removed when deleted while having children and following the differential snapshot cycle.
Also, when one takes a volume snapshot and ACS backs it up to the secondary storage, ACS inserts 2 entries on table cloud.snapshot_store_ref
(Primary
and Image
). When one deletes a volume snapshot, ACS first tries to remove the snapshot from the secondary storage and mark the entry Image
as removed; then, it tries to remove the snapshot from the primary storage and mark the entry Primary
as removed. If ACS cannot remove the snapshot from the primary storage, it will keep the snapshot as BackedUp
; however, If it does not exist in the secondary storage and without the entry SNAPSHOT.DELETE
on cloud.usage_event
. In the end, after the garbage collector flow, the snapshot will be marked as BackedUp
, with a value in the field removed
and still being rated. This PR also addresses the correction for this situation.
Types of changes
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] New feature (non-breaking change which adds functionality)
- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] Enhancement (improves an existing feature and functionality)
- [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
Feature/Enhancement Scale or Bug Severity
Bug Severity
- [ ] BLOCKER
- [ ] Critical
- [ ] Major
- [x] Minor
- [ ] Trivial
How Has This Been Tested?
The situation was observed in XenServer environments; however, due to some internal circumstances, I had to reproduce the situation in a KVM environment (considering 2 as max deltas).
I created a VM and scheduled an hourly snapshot for the ROOT
volume, retaining 2 snapshots. After ACS take the first two snapshots (and before taking the third one), I manually changed the database to put the ID of the first snapshot as the parent of the second, to simulate the XenServer differential snapshot. After the third snapshot was generated, the first one was marked as Destroyed
, ACS generated the entry SNAPSHOT.DELETE
on cloud.usage_event
, and the entries Primary
and Image
ended up as Destroyed
and Ready
, respectively. After the fourth snapshot was generated, ACS identified that the second one was the last on the hierarchy and started removing the hierarchy. In the end, the entries for the first and second snapshots were marked as removed and only the last 2 snapshots got entries in Ready
state.
I also forced errors in the deletion of the snapshot in the primary and secondary storage. At the end of both tests, ACS inserted the entries SNAPSHOT.DELETE
on cloud.usage_event
and the garbage collector removed the entries of the cloud.snapshot_store_ref
.
Codecov Report
Merging #6630 (5e268ce) into main (d3ec27d) will increase coverage by
0.00%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## main #6630 +/- ##
=========================================
Coverage 5.87% 5.88%
- Complexity 3933 3940 +7
=========================================
Files 2454 2454
Lines 242557 242597 +40
Branches 37965 37974 +9
=========================================
+ Hits 14246 14271 +25
- Misses 226735 226750 +15
Partials 1576 1576
Impacted Files | Coverage Δ | |
---|---|---|
...tack/storage/snapshot/DefaultSnapshotStrategy.java | 23.98% <100.00%> (+7.46%) |
:arrow_up: |
...ava/com/cloud/upgrade/dao/Upgrade41700to41710.java | 4.76% <0.00%> (-3.24%) |
:arrow_down: |
...ervisor/kvm/resource/LibvirtComputingResource.java | 15.97% <0.00%> (-0.02%) |
:arrow_down: |
...tastore/driver/StorPoolPrimaryDataStoreDriver.java | 0.00% <0.00%> (ø) |
|
...rc/main/java/com/cloud/storage/VolumeDetailVO.java | 21.42% <0.00%> (+21.42%) |
:arrow_up: |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
@blueorangutan package
@DaanHoogland a Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.
Packaging result: :heavy_check_mark: el7 :heavy_check_mark: el8 :heavy_check_mark: debian :heavy_check_mark: suse15. SL-JID 3970
@blueorangutan test matrix
@DaanHoogland a Trillian-Jenkins matrix job (centos7 mgmt + xs71, centos7 mgmt + vmware65, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests
Trillian test result (tid-4689) Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7 Total time taken: 41592 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr6630-t4689-kvm-centos7.zip Smoke tests completed. 101 look OK, 0 have errors Only failed tests results shown below:
Test | Result | Time (s) | Test File |
---|
Trillian test result (tid-4688) Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7 Total time taken: 42578 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr6630-t4688-xenserver-71.zip Smoke tests completed. 100 look OK, 1 have errors Only failed tests results shown below:
Test | Result | Time (s) | Test File |
---|---|---|---|
test_02_cancel_host_maintenace_with_migration_jobs | Error |
1512.23 | test_host_maintenance.py |
Trillian test result (tid-4690) Environment: vmware-65u2 (x2), Advanced Networking with Mgmt server 7 Total time taken: 44395 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr6630-t4690-vmware-65u2.zip Smoke tests completed. 100 look OK, 1 have errors Only failed tests results shown below:
Test | Result | Time (s) | Test File |
---|---|---|---|
test_08_upgrade_kubernetes_ha_cluster | Failure |
636.31 | test_kubernetes_clusters.py |
@blueorangutan test centos7 xenserver-71
@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + xenserver-71) has been kicked to run smoke tests
Trillian test result (tid-4704) Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7 Total time taken: 36187 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr6630-t4704-xenserver-71.zip Smoke tests completed. 101 look OK, 0 have errors Only failed tests results shown below:
Test | Result | Time (s) | Test File |
---|
Thanks @GutoVeronezi, would this need extra marvin tests?
@nvazquez, as tests passed with KVM, Xen and VMware, maybe there is no need for more marvin tests; however, some "monkey testing" would be nice, as @DaanHoogland suggested.
Hello guys, any update about this one?
@GutoVeronezi I'll give it a quick spin. I think we can merge this otherwise. @blueorangutan test keepEnv
@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests
Trillian test result (tid-5106) Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7 Total time taken: 37458 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr6630-t5106-kvm-centos7.zip Smoke tests completed. 101 look OK, 0 have errors, 0 did not run Only failed and skipped tests results shown below:
Test | Result | Time (s) | Test File |
---|
Trillian test result (tid-5115) Environment: xenserver-74 (x2), Advanced Networking with Mgmt server 7 Total time taken: 44100 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr6630-t5115-xenserver-74.zip Smoke tests completed. 100 look OK, 1 have errors, 0 did not run Only failed and skipped tests results shown below:
Test | Result | Time (s) | Test File |
---|---|---|---|
test_13_migrate_volume_and_change_offering | Error |
7.41 | test_volumes.py |
as a matter of formality ; tested on xenserver-7.4 with no visible problems from user side. after six hourly backup six entries in the db, four with phisical size 0 only one marked as primary. events:
|id |type |account_id|created |zone_id|resource_id|resource_name |offering_id|template_id|size |resource_type|processed|virtual_size|
|-----|--------------------|----------|-----------------------|-------|-----------|---------------------------|-----------|-----------|----------|-------------|---------|------------|
|2,612|VM.START |2 |2022-10-14 08:42:38.000|1 |349 |vm1 |1 |287 | |XenServer |0 | |
|2,613|SNAPSHOT.ON_PRIMARY |2 |2022-10-14 08:54:15.000|1 |7 |vm1_ROOT-349_20221014085410| | |52,428,800| |0 |52,428,800 |
|2,614|SNAPSHOT.CREATE |2 |2022-10-14 08:54:25.000|1 |7 |vm1_ROOT-349_20221014085410| | |52,531,712| |0 |52,428,800 |
|2,615|SNAPSHOT.ON_PRIMARY |2 |2022-10-14 09:54:15.000|1 |8 |vm1_ROOT-349_20221014095410| | |52,428,800| |0 |52,428,800 |
|2,616|SNAPSHOT.OFF_PRIMARY|2 |2022-10-14 09:54:16.000|1 |7 |vm1_ROOT-349_20221014085410| | |0 | |0 |0 |
|2,617|SNAPSHOT.CREATE |2 |2022-10-14 09:54:16.000|1 |8 |vm1_ROOT-349_20221014095410| | |0 | |0 |52,428,800 |
|2,618|SNAPSHOT.ON_PRIMARY |2 |2022-10-14 10:54:15.000|1 |9 |vm1_ROOT-349_20221014105410| | |52,428,800| |0 |52,428,800 |
|2,619|SNAPSHOT.OFF_PRIMARY|2 |2022-10-14 10:54:16.000|1 |8 |vm1_ROOT-349_20221014095410| | |0 | |0 |0 |
|2,620|SNAPSHOT.DELETE |2 |2022-10-14 10:54:16.000|1 |7 |vm1_ROOT-349_20221014085410| | |0 | |0 | |
|2,621|SNAPSHOT.CREATE |2 |2022-10-14 10:54:16.000|1 |9 |vm1_ROOT-349_20221014105410| | |0 | |0 |52,428,800 |
|2,622|SNAPSHOT.ON_PRIMARY |2 |2022-10-14 11:54:16.000|1 |10 |vm1_ROOT-349_20221014115410| | |52,428,800| |0 |52,428,800 |
|2,623|SNAPSHOT.OFF_PRIMARY|2 |2022-10-14 11:54:16.000|1 |9 |vm1_ROOT-349_20221014105410| | |0 | |0 |0 |
|2,624|SNAPSHOT.DELETE |2 |2022-10-14 11:54:16.000|1 |8 |vm1_ROOT-349_20221014095410| | |0 | |0 | |
|2,625|SNAPSHOT.CREATE |2 |2022-10-14 11:54:16.000|1 |10 |vm1_ROOT-349_20221014115410| | |0 | |0 |52,428,800 |
|2,626|SNAPSHOT.ON_PRIMARY |2 |2022-10-14 12:54:16.000|1 |11 |vm1_ROOT-349_20221014125410| | |52,428,800| |0 |52,428,800 |
|2,627|SNAPSHOT.OFF_PRIMARY|2 |2022-10-14 12:54:16.000|1 |10 |vm1_ROOT-349_20221014115410| | |0 | |0 |0 |
|2,628|SNAPSHOT.DELETE |2 |2022-10-14 12:54:16.000|1 |9 |vm1_ROOT-349_20221014105410| | |0 | |0 | |
|2,629|SNAPSHOT.CREATE |2 |2022-10-14 12:54:16.000|1 |11 |vm1_ROOT-349_20221014125410| | |0 | |0 |52,428,800 |