kvm: ref-count storage pool usage
Description
If a storage pool is used by e.g. 2 concurrent snapshot->template actions, if the first action finished it removed the netfs mount point for the other action. Now the storage pools are usage ref-counted and will only deleted if there are no more users.
Fixes: #8899
Types of changes
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] New feature (non-breaking change which adds functionality)
- [x] Bug fix (non-breaking change which fixes an issue)
- [ ] Enhancement (improves an existing feature and functionality)
- [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
- [ ] build/CI
- [ ] test (unit or integration test code)
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
- [ ] Major
- [ ] Minor
Bug Severity
- [ ] BLOCKER
- [ ] Critical
- [x] Major
- [ ] Minor
- [ ] Trivial
Screenshots (if appropriate):
How Has This Been Tested?
Run several snapshot to template actions, that are executed on the same host.
How did you try to break this feature and the system with this change?
Codecov Report
Attention: Patch coverage is 6.06061% with 31 lines in your changes missing coverage. Please review.
Project coverage is 15.10%. Comparing base (
03bdf11) to head (20735be). Report is 2 commits behind head on 4.19.
Additional details and impacted files
@@ Coverage Diff @@
## 4.19 #9498 +/- ##
===========================================
Coverage 15.10% 15.10%
- Complexity 11220 11225 +5
===========================================
Files 5404 5404
Lines 473460 473486 +26
Branches 57728 59047 +1319
===========================================
+ Hits 71525 71541 +16
- Misses 393941 393948 +7
- Partials 7994 7997 +3
| Flag | Coverage Δ | |
|---|---|---|
| uitests | 4.30% <ø> (ø) |
|
| unittests | 15.82% <6.06%> (+<0.01%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
clgtm, you do have a good test scenario for this, do you @rp- ? Or is it only intermitted (i.e. not automatable)
I'm not sure it is easy to reproducible automate that, as it is a timing/parallelism issue. I didn't even try yet if an NFS primary storage uses the same code paths, but I might do that this week to see if it would also be affected somehow.
But we have 2 customers who didn't report any issues with this yet.
@blueorangutan package
@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 10620
@blueorangutan test
@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests
[SF] Trillian test result (tid-11065) Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8 Total time taken: 47025 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9498-t11065-kvm-ol8.zip Smoke tests completed. 127 look OK, 6 have errors, 0 did not run Only failed and skipped tests results shown below:
| Test | Result | Time (s) | Test File |
|---|---|---|---|
| test_01_add_primary_storage_disabled_host | Error |
0.33 | test_primary_storage.py |
| test_01_primary_storage_nfs | Error |
0.37 | test_primary_storage.py |
| ContextSuite context=TestStorageTags>:setup | Error |
0.63 | test_primary_storage.py |
| test_01_primary_storage_scope_change | Error |
0.21 | test_primary_storage_scope.py |
| ContextSuite context=TestCpuCapServiceOfferings>:setup | Error |
0.00 | test_service_offerings.py |
| test_02_list_snapshots_with_removed_data_store | Error |
8.74 | test_snapshots.py |
| test_02_list_snapshots_with_removed_data_store | Error |
8.75 | test_snapshots.py |
| test_01_deploy_vm_on_specific_host | Error |
0.11 | test_vm_deployment_planner.py |
| test_04_deploy_vm_on_host_override_pod_and_cluster | Error |
0.14 | test_vm_deployment_planner.py |
| test_01_migrate_VM_and_root_volume | Error |
83.40 | test_vm_life_cycle.py |
| test_02_migrate_VM_with_two_data_disks | Error |
50.91 | test_vm_life_cycle.py |
| test_01_secure_vm_migration | Error |
134.41 | test_vm_life_cycle.py |
| test_01_secure_vm_migration | Error |
134.42 | test_vm_life_cycle.py |
| test_08_migrate_vm | Error |
0.06 | test_vm_life_cycle.py |
@blueorangutan package
@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.
Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 10927
Packaging result [SF]: ✖️ el8 ✖️ el9 ✔️ debian ✖️ suse15. SL-JID 10950
I guess the failed packaging is nothing related to this PR?
I guess the failed packaging is nothing related to this PR?
11:02:25 [ERROR] Failures:
11:02:25 [ERROR] VMSchedulerImplTest.testScheduleNextJobScheduleCurrentSchedule:262 expected:<Wed Sep 04 09:02:00 UTC 2024> but was:<Wed Sep 04 09:03:00 UTC 2024>
looks like a test was too slow , so might have to do with too busy container. retrying @rp-
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 10988
@blueorangutan test
@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests
[SF] Trillian test result (tid-11364) Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8 Total time taken: 56881 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9498-t11364-kvm-ol8.zip Smoke tests completed. 125 look OK, 8 have errors, 0 did not run Only failed and skipped tests results shown below:
| Test | Result | Time (s) | Test File |
|---|---|---|---|
| test_01_add_primary_storage_disabled_host | Error |
0.66 | test_primary_storage.py |
| test_01_primary_storage_nfs | Error |
0.33 | test_primary_storage.py |
| ContextSuite context=TestStorageTags>:setup | Error |
0.62 | test_primary_storage.py |
| test_01_primary_storage_scope_change | Error |
0.22 | test_primary_storage_scope.py |
| ContextSuite context=TestCpuCapServiceOfferings>:setup | Error |
0.00 | test_service_offerings.py |
| test_02_list_snapshots_with_removed_data_store | Error |
9.77 | test_snapshots.py |
| test_02_list_snapshots_with_removed_data_store | Error |
9.77 | test_snapshots.py |
| test_01_volume_usage | Failure |
848.98 | test_usage.py |
| test_01_deploy_vm_on_specific_host | Error |
0.10 | test_vm_deployment_planner.py |
| test_04_deploy_vm_on_host_override_pod_and_cluster | Error |
0.13 | test_vm_deployment_planner.py |
| test_01_migrate_VM_and_root_volume | Error |
87.68 | test_vm_life_cycle.py |
| test_02_migrate_VM_with_two_data_disks | Error |
52.01 | test_vm_life_cycle.py |
| test_01_secure_vm_migration | Error |
316.92 | test_vm_life_cycle.py |
| test_02_unsecure_vm_migration | Error |
459.21 | test_vm_life_cycle.py |
| test_08_migrate_vm | Error |
0.09 | test_vm_life_cycle.py |
| test_06_download_detached_volume | Error |
310.28 | test_volumes.py |
@blueorangutan package
@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11033
@blueorangutan test
@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests
[SF] Trillian test result (tid-11408) Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8 Total time taken: 44373 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9498-t11408-kvm-ol8.zip Smoke tests completed. 127 look OK, 6 have errors, 0 did not run Only failed and skipped tests results shown below:
| Test | Result | Time (s) | Test File |
|---|---|---|---|
| test_01_add_primary_storage_disabled_host | Error |
0.31 | test_primary_storage.py |
| test_01_primary_storage_nfs | Error |
0.30 | test_primary_storage.py |
| ContextSuite context=TestStorageTags>:setup | Error |
0.60 | test_primary_storage.py |
| test_01_primary_storage_scope_change | Error |
0.21 | test_primary_storage_scope.py |
| ContextSuite context=TestCpuCapServiceOfferings>:setup | Error |
0.00 | test_service_offerings.py |
| test_02_list_snapshots_with_removed_data_store | Error |
8.63 | test_snapshots.py |
| test_02_list_snapshots_with_removed_data_store | Error |
8.63 | test_snapshots.py |
| test_01_deploy_vm_on_specific_host | Error |
0.09 | test_vm_deployment_planner.py |
| test_04_deploy_vm_on_host_override_pod_and_cluster | Error |
0.14 | test_vm_deployment_planner.py |
| test_01_migrate_VM_and_root_volume | Error |
82.29 | test_vm_life_cycle.py |
| test_02_migrate_VM_with_two_data_disks | Error |
50.76 | test_vm_life_cycle.py |
| test_01_secure_vm_migration | Error |
134.37 | test_vm_life_cycle.py |
| test_01_secure_vm_migration | Error |
134.37 | test_vm_life_cycle.py |
| test_08_migrate_vm | Error |
0.08 | test_vm_life_cycle.py |
Are there more logs to this "needs access to storage pool"? Or is this another problem?
Are there more logs to this "needs access to storage pool"? Or is this another problem?
sorry @rp- , missing context here. If you are referring to the smoke tests, the download does contain the management server logs.
I see exceptions like this:
2024-09-06 16:25:57,620 DEBUG [c.c.a.t.Request] (AgentManager-Handler-13:null) (logid:) Seq 1-8334192585425814054: Processing: { Ans: , MgmtId: 32986405799053, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.Answer":{"result":"false","details":"com.cloud.utils.exception.CloudRuntimeException: libvirt failed to mount storage pool 97fc931d-601a-3ec4-b2bd-5634380ea92b at /mnt/97fc931d-601a-3ec4-b2bd-5634380ea92b
at com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.checkNetfsStoragePoolMounted(LibvirtStorageAdaptor.java:284)
at com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.createStoragePool(LibvirtStorageAdaptor.java:787)
at com.cloud.hypervisor.kvm.storage.KVMStoragePoolManager.createStoragePool(KVMStoragePoolManager.java:364)
at com.cloud.hypervisor.kvm.storage.KVMStoragePoolManager.createStoragePool(KVMStoragePoolManager.java:358)
at com.cloud.hypervisor.kvm.resource.wrapper.LibvirtModifyStoragePoolCommandWrapper.execute(LibvirtModifyStoragePoolCommandWrapper.java:42)
at com.cloud.hypervisor.kvm.resource.wrapper.LibvirtModifyStoragePoolCommandWrapper.execute(LibvirtModifyStoragePoolCommandWrapper.java:35)
at com.cloud.hypervisor.kvm.resource.wrapper.LibvirtRequestWrapper.execute(LibvirtRequestWrapper.java:78)
at com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1929)
at com.cloud.agent.Agent.processRequest(Agent.java:683)
at com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:1106)
at com.cloud.utils.nio.Task.call(Task.java:83)
at com.cloud.utils.nio.Task.call(Task.java:29)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
","wait":"0","bypassHostMaintenance":"false"}}] }
But here the agent logs would be interesting.
ah, no, i don´t have those. I'll run again without the teardown step and have a look.
@blueorangutan test keepEnv
@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests