cloudstack icon indicating copy to clipboard operation
cloudstack copied to clipboard

Fix delete parent snapshot

Open GutoVeronezi opened this issue 2 years ago • 15 comments

Description

ACS + Xenserver works with differential snapshots. ACS takes a volume full snapshot and the next ones are referenced as a child of the previous snapshot until the chain reaches the limit defined in the global setting snapshot.delta.max; then, a new full snapshot is taken. PR #5297 introduced disk-only snapshots for KVM volumes. Among the changes, the delete process was also refactored. Before the changes, when one was removing a snapshot with children, ACS was marking it as Destroyed and it was keeping the Image entry on the table cloud.snapshot_store_ref as Ready. When ACS was rotating the snapshots (the max delta was reached) and all the children were already marked as removed; then, ACS would start removing the whole hierarchy, completing the differential snapshot cycle. After the changes, the snapshots with children stopped being marked as removed and the differential snapshot cycle was not being completed.

This PR intends to honor again the differential snapshot cycle for XenServer, making the snapshots to be marked as removed when deleted while having children and following the differential snapshot cycle.

Also, when one takes a volume snapshot and ACS backs it up to the secondary storage, ACS inserts 2 entries on table cloud.snapshot_store_ref (Primary and Image). When one deletes a volume snapshot, ACS first tries to remove the snapshot from the secondary storage and mark the entry Image as removed; then, it tries to remove the snapshot from the primary storage and mark the entry Primary as removed. If ACS cannot remove the snapshot from the primary storage, it will keep the snapshot as BackedUp; however, If it does not exist in the secondary storage and without the entry SNAPSHOT.DELETE on cloud.usage_event. In the end, after the garbage collector flow, the snapshot will be marked as BackedUp, with a value in the field removed and still being rated. This PR also addresses the correction for this situation.

Types of changes

  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] New feature (non-breaking change which adds functionality)
  • [x] Bug fix (non-breaking change which fixes an issue)
  • [ ] Enhancement (improves an existing feature and functionality)
  • [ ] Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Bug Severity

  • [ ] BLOCKER
  • [ ] Critical
  • [ ] Major
  • [x] Minor
  • [ ] Trivial

How Has This Been Tested?

The situation was observed in XenServer environments; however, due to some internal circumstances, I had to reproduce the situation in a KVM environment (considering 2 as max deltas).

I created a VM and scheduled an hourly snapshot for the ROOT volume, retaining 2 snapshots. After ACS take the first two snapshots (and before taking the third one), I manually changed the database to put the ID of the first snapshot as the parent of the second, to simulate the XenServer differential snapshot. After the third snapshot was generated, the first one was marked as Destroyed, ACS generated the entry SNAPSHOT.DELETE on cloud.usage_event, and the entries Primary and Image ended up as Destroyed and Ready, respectively. After the fourth snapshot was generated, ACS identified that the second one was the last on the hierarchy and started removing the hierarchy. In the end, the entries for the first and second snapshots were marked as removed and only the last 2 snapshots got entries in Ready state.

I also forced errors in the deletion of the snapshot in the primary and secondary storage. At the end of both tests, ACS inserted the entries SNAPSHOT.DELETE on cloud.usage_event and the garbage collector removed the entries of the cloud.snapshot_store_ref.

GutoVeronezi avatar Aug 10 '22 21:08 GutoVeronezi

Codecov Report

Merging #6630 (5e268ce) into main (d3ec27d) will increase coverage by 0.00%. The diff coverage is 100.00%.

@@            Coverage Diff            @@
##               main    #6630   +/-   ##
=========================================
  Coverage      5.87%    5.88%           
- Complexity     3933     3940    +7     
=========================================
  Files          2454     2454           
  Lines        242557   242597   +40     
  Branches      37965    37974    +9     
=========================================
+ Hits          14246    14271   +25     
- Misses       226735   226750   +15     
  Partials       1576     1576           
Impacted Files Coverage Δ
...tack/storage/snapshot/DefaultSnapshotStrategy.java 23.98% <100.00%> (+7.46%) :arrow_up:
...ava/com/cloud/upgrade/dao/Upgrade41700to41710.java 4.76% <0.00%> (-3.24%) :arrow_down:
...ervisor/kvm/resource/LibvirtComputingResource.java 15.97% <0.00%> (-0.02%) :arrow_down:
...tastore/driver/StorPoolPrimaryDataStoreDriver.java 0.00% <0.00%> (ø)
...rc/main/java/com/cloud/storage/VolumeDetailVO.java 21.42% <0.00%> (+21.42%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov[bot] avatar Aug 11 '22 20:08 codecov[bot]

@blueorangutan package

DaanHoogland avatar Aug 12 '22 08:08 DaanHoogland

@DaanHoogland a Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar Aug 12 '22 08:08 blueorangutan

Packaging result: :heavy_check_mark: el7 :heavy_check_mark: el8 :heavy_check_mark: debian :heavy_check_mark: suse15. SL-JID 3970

blueorangutan avatar Aug 12 '22 09:08 blueorangutan

@blueorangutan test matrix

DaanHoogland avatar Aug 15 '22 09:08 DaanHoogland

@DaanHoogland a Trillian-Jenkins matrix job (centos7 mgmt + xs71, centos7 mgmt + vmware65, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

blueorangutan avatar Aug 15 '22 09:08 blueorangutan

Trillian test result (tid-4689) Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7 Total time taken: 41592 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr6630-t4689-kvm-centos7.zip Smoke tests completed. 101 look OK, 0 have errors Only failed tests results shown below:

Test Result Time (s) Test File

blueorangutan avatar Aug 15 '22 21:08 blueorangutan

Trillian test result (tid-4688) Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7 Total time taken: 42578 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr6630-t4688-xenserver-71.zip Smoke tests completed. 100 look OK, 1 have errors Only failed tests results shown below:

Test Result Time (s) Test File
test_02_cancel_host_maintenace_with_migration_jobs Error 1512.23 test_host_maintenance.py

blueorangutan avatar Aug 15 '22 21:08 blueorangutan

Trillian test result (tid-4690) Environment: vmware-65u2 (x2), Advanced Networking with Mgmt server 7 Total time taken: 44395 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr6630-t4690-vmware-65u2.zip Smoke tests completed. 100 look OK, 1 have errors Only failed tests results shown below:

Test Result Time (s) Test File
test_08_upgrade_kubernetes_ha_cluster Failure 636.31 test_kubernetes_clusters.py

blueorangutan avatar Aug 15 '22 21:08 blueorangutan

@blueorangutan test centos7 xenserver-71

DaanHoogland avatar Aug 16 '22 08:08 DaanHoogland

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + xenserver-71) has been kicked to run smoke tests

blueorangutan avatar Aug 16 '22 08:08 blueorangutan

Trillian test result (tid-4704) Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7 Total time taken: 36187 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr6630-t4704-xenserver-71.zip Smoke tests completed. 101 look OK, 0 have errors Only failed tests results shown below:

Test Result Time (s) Test File

blueorangutan avatar Aug 16 '22 18:08 blueorangutan

Thanks @GutoVeronezi, would this need extra marvin tests?

nvazquez avatar Aug 30 '22 16:08 nvazquez

@nvazquez, as tests passed with KVM, Xen and VMware, maybe there is no need for more marvin tests; however, some "monkey testing" would be nice, as @DaanHoogland suggested.

GutoVeronezi avatar Sep 08 '22 14:09 GutoVeronezi

Hello guys, any update about this one?

GutoVeronezi avatar Oct 11 '22 14:10 GutoVeronezi

@GutoVeronezi I'll give it a quick spin. I think we can merge this otherwise. @blueorangutan test keepEnv

DaanHoogland avatar Oct 12 '22 08:10 DaanHoogland

@DaanHoogland a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan avatar Oct 12 '22 08:10 blueorangutan

Trillian test result (tid-5106) Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7 Total time taken: 37458 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr6630-t5106-kvm-centos7.zip Smoke tests completed. 101 look OK, 0 have errors, 0 did not run Only failed and skipped tests results shown below:

Test Result Time (s) Test File

blueorangutan avatar Oct 13 '22 06:10 blueorangutan

Trillian test result (tid-5115) Environment: xenserver-74 (x2), Advanced Networking with Mgmt server 7 Total time taken: 44100 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr6630-t5115-xenserver-74.zip Smoke tests completed. 100 look OK, 1 have errors, 0 did not run Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_13_migrate_volume_and_change_offering Error 7.41 test_volumes.py

blueorangutan avatar Oct 13 '22 19:10 blueorangutan

as a matter of formality ; tested on xenserver-7.4 with no visible problems from user side. after six hourly backup six entries in the db, four with phisical size 0 only one marked as primary. events:

|id   |type                |account_id|created                |zone_id|resource_id|resource_name              |offering_id|template_id|size      |resource_type|processed|virtual_size|
|-----|--------------------|----------|-----------------------|-------|-----------|---------------------------|-----------|-----------|----------|-------------|---------|------------|
|2,612|VM.START            |2         |2022-10-14 08:42:38.000|1      |349        |vm1                        |1          |287        |          |XenServer    |0        |            |
|2,613|SNAPSHOT.ON_PRIMARY |2         |2022-10-14 08:54:15.000|1      |7          |vm1_ROOT-349_20221014085410|           |           |52,428,800|             |0        |52,428,800  |
|2,614|SNAPSHOT.CREATE     |2         |2022-10-14 08:54:25.000|1      |7          |vm1_ROOT-349_20221014085410|           |           |52,531,712|             |0        |52,428,800  |
|2,615|SNAPSHOT.ON_PRIMARY |2         |2022-10-14 09:54:15.000|1      |8          |vm1_ROOT-349_20221014095410|           |           |52,428,800|             |0        |52,428,800  |
|2,616|SNAPSHOT.OFF_PRIMARY|2         |2022-10-14 09:54:16.000|1      |7          |vm1_ROOT-349_20221014085410|           |           |0         |             |0        |0           |
|2,617|SNAPSHOT.CREATE     |2         |2022-10-14 09:54:16.000|1      |8          |vm1_ROOT-349_20221014095410|           |           |0         |             |0        |52,428,800  |
|2,618|SNAPSHOT.ON_PRIMARY |2         |2022-10-14 10:54:15.000|1      |9          |vm1_ROOT-349_20221014105410|           |           |52,428,800|             |0        |52,428,800  |
|2,619|SNAPSHOT.OFF_PRIMARY|2         |2022-10-14 10:54:16.000|1      |8          |vm1_ROOT-349_20221014095410|           |           |0         |             |0        |0           |
|2,620|SNAPSHOT.DELETE     |2         |2022-10-14 10:54:16.000|1      |7          |vm1_ROOT-349_20221014085410|           |           |0         |             |0        |            |
|2,621|SNAPSHOT.CREATE     |2         |2022-10-14 10:54:16.000|1      |9          |vm1_ROOT-349_20221014105410|           |           |0         |             |0        |52,428,800  |
|2,622|SNAPSHOT.ON_PRIMARY |2         |2022-10-14 11:54:16.000|1      |10         |vm1_ROOT-349_20221014115410|           |           |52,428,800|             |0        |52,428,800  |
|2,623|SNAPSHOT.OFF_PRIMARY|2         |2022-10-14 11:54:16.000|1      |9          |vm1_ROOT-349_20221014105410|           |           |0         |             |0        |0           |
|2,624|SNAPSHOT.DELETE     |2         |2022-10-14 11:54:16.000|1      |8          |vm1_ROOT-349_20221014095410|           |           |0         |             |0        |            |
|2,625|SNAPSHOT.CREATE     |2         |2022-10-14 11:54:16.000|1      |10         |vm1_ROOT-349_20221014115410|           |           |0         |             |0        |52,428,800  |
|2,626|SNAPSHOT.ON_PRIMARY |2         |2022-10-14 12:54:16.000|1      |11         |vm1_ROOT-349_20221014125410|           |           |52,428,800|             |0        |52,428,800  |
|2,627|SNAPSHOT.OFF_PRIMARY|2         |2022-10-14 12:54:16.000|1      |10         |vm1_ROOT-349_20221014115410|           |           |0         |             |0        |0           |
|2,628|SNAPSHOT.DELETE     |2         |2022-10-14 12:54:16.000|1      |9          |vm1_ROOT-349_20221014105410|           |           |0         |             |0        |            |
|2,629|SNAPSHOT.CREATE     |2         |2022-10-14 12:54:16.000|1      |11         |vm1_ROOT-349_20221014125410|           |           |0         |             |0        |52,428,800  |

DaanHoogland avatar Oct 14 '22 13:10 DaanHoogland