cloudstack
cloudstack copied to clipboard
Do not retrieve VM's stats on normal VM listing
Description
For both the listVirtualMachines
and listVirtualMachineMetrics
APIs, by default, ACS queries all VM details (the details
parameter is default all
) and consequently queries the cloud.vm_stats
table for each VM, creating a summary of statistics. Therefore, every process carried out in the UI that uses the listVirtualMachines
API also loads the VM statistics, which can cause slowdowns depending on the number of records.
A new configuration return.vm.stats.on.vm.list
(with default true), was added, when it is false, the listVirtualMachines
API will not list the VM's metrics by default.
Types of changes
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Bug fix (non-breaking change which fixes an issue)
- [X] Enhancement (improves an existing feature and functionality)
- [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
- [ ] build/CI
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
- [X] Major
- [ ] Minor
Bug Severity
- [ ] BLOCKER
- [ ] Critical
- [ ] Major
- [ ] Minor
- [ ] Trivial
How Has This Been Tested?
Called listVirtualMachines API, without any parameters:
- before this PR, the all the VM's details were returned;
- with this PR and
return.vm.stats.on.vm.list
as true (default), all the details were returned as well; - with this PR and
return.vm.stats.on.vm.list
as false, all the details except the metrics were returned.
Codecov Report
Attention: Patch coverage is 47.82609%
with 12 lines
in your changes are missing coverage. Please review.
Project coverage is 30.93%. Comparing base (
a7ec873
) to head (c711a39
). Report is 7 commits behind head on 4.19.
Additional details and impacted files
@@ Coverage Diff @@
## 4.19 #8782 +/- ##
=========================================
Coverage 30.93% 30.93%
- Complexity 34263 34276 +13
=========================================
Files 5353 5354 +1
Lines 376055 376100 +45
Branches 54691 54697 +6
=========================================
+ Hits 116317 116346 +29
- Misses 244443 244453 +10
- Partials 15295 15301 +6
Flag | Coverage Δ | |
---|---|---|
simulator-marvin-tests | 24.76% <52.38%> (+<0.01%) |
:arrow_up: |
uitests | 4.39% <0.00%> (-0.01%) |
:arrow_down: |
unit-tests | 16.58% <0.00%> (-0.01%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@JoaoJandre I suggest to check/fix the API calls from UI to list vms or vm metrics.
It is not good to change the default behavior if it is not a bug, IMHO.
For both the listVirtualMachines and listVirtualMachineMetrics APIs, by default, ACS queries all VM details (the details parameter is default all) and consequently queries the cloud.vm_stats table for each VM,
This is because the listVirtualMachinesMetrics API (and all Metrics API) are superset of the non-metrics APIs. All the Metrics APIs add something related to metrics in their API response in addition to what a non-metrics API would return. I think we shouldn't change the default behaviour of the non-metrics API - and if you really have a use-case for this see if there's any other way to do what you're trying to accomplish or worst-case add a global setting that allows the behaviour you want but keep the setting's default in such a way that it continue old behaviour for other users.
@JoaoJandre I suggest to check/fix the API calls from UI to list vms or vm metrics.
It is not good to change the default behavior if it is not a bug, IMHO.
@weizhouapache, @sureshanaparti and @rohityadavcloud, we could change only the UI behavior; however, why do we have this default behavior?
We do not use these metrics that are returned on this API, when the metrics are used, users, operators and the UI use the listVirtualMachineMetrics
API. That is, we already have an API that has this exact purpose.
Not long ago we changed the default behavior of the deleteTemplate
API on #7731; As with this situation, the one on #7731 wasn't a bug, it was an inconvenient behavior. A user/operator that only wants to list the VMs intuitively will use the listVirtualMachines
API; however, depending on the environment's size, he will deal with slowness because of the default behavior. As with #7731, we could only a change the UI, but changing the default behavior makes more sense.
Codecov Report
Attention: Patch coverage is 3.22581%
with 30 lines
in your changes are missing coverage. Please review.
Project coverage is 14.96%. Comparing base (
a7ec873
) to head (e0f7462
). Report is 109 commits behind head on 4.19.
Additional details and impacted files
@@ Coverage Diff @@
## 4.19 #8782 +/- ##
=============================================
- Coverage 30.93% 14.96% -15.97%
+ Complexity 34263 10989 -23274
=============================================
Files 5353 5373 +20
Lines 376055 469034 +92979
Branches 54691 57597 +2906
=============================================
- Hits 116317 70177 -46140
- Misses 244443 391087 +146644
+ Partials 15295 7770 -7525
Flag | Coverage Δ | |
---|---|---|
simulator-marvin-tests | ? |
|
uitests | 4.31% <ø> (-0.08%) |
:arrow_down: |
unit-tests | ? |
|
unittests | 15.66% <3.22%> (?) |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@blueorangutan package
@JoaoJandre a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9549
@rohityadavcloud , @weizhouapache I added a configuration to control the behavior of the listVirtualMachines
API and updated the description. Now the default behavior is maintained. I hope we can advance the discussion on #8970 so we can create a proper mechanism to change the default behavior in the future.
@blueorangutan test rocky8 kvm-rocky8
@weizhouapache a [SL] Trillian-Jenkins test job (rocky8 mgmt + kvm-rocky8) has been kicked to run smoke tests
@rohityadavcloud , @weizhouapache I added a configuration to control the behavior of the
listVirtualMachines
API and updated the description. Now the default behavior is maintained. I hope we can advance the discussion on #8970 so we can create a proper mechanism to change the default behavior in the future.
thanks @JoaoJandre for the update
[SF] Trillian test result (tid-10170) Environment: kvm-rocky8 (x2), Advanced Networking with Mgmt server r8 Total time taken: 49325 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8782-t10170-kvm-rocky8.zip Smoke tests completed. 129 look OK, 2 have errors, 0 did not run Only failed and skipped tests results shown below:
Test | Result | Time (s) | Test File |
---|---|---|---|
test_01_events_resource | Error |
418.29 | test_events_resource.py |
test_01_restore_vm | Error |
0.24 | test_restore_vm.py |
test_02_restore_vm_allocated_root | Error |
0.18 | test_restore_vm.py |
ContextSuite context=TestRestoreVM>:teardown | Error |
1.26 | test_restore_vm.py |
I like the idea of splitting which API is called for metrics vs non-metrics list view (I might or others should steal the pattern for all metrics API usage across the UI). I'm not fully satisfied with the PR @JoaoJandre yet, and I would rather encourage you can pick some ideas from #8985 - that said, I'm also inclined to make progress in a cordial and mature manner. It would be easier to get this merge and optimise the general solution further as required (by me or others).
I'm working on a much wider PR that's in research and progress, that's around wider scalability issues of CloudStack (surprisingly I'm near the root cause, and it maybe possible to even get stats without much penalty). All that said - I wouldn't remember everything I write on Github on each and every PR, take my review with a pinch of salt and I may change my views on things as I'm dealing with a wider scalability problem. I'll leave some comments, but think let's go ahead.
@sureshanaparti ok by you like this?
@blueorangutan package
@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✖️ debian ✔️ suse15. SL-JID 9631
ping @sureshanaparti
@blueorangutan package
@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9663
@blueorangutan test
@DaanHoogland a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests
[SF] Trillian test result (tid-10254) Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7 Total time taken: 43718 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8782-t10254-kvm-centos7.zip Smoke tests completed. 130 look OK, 1 have errors, 0 did not run Only failed and skipped tests results shown below:
Test | Result | Time (s) | Test File |
---|---|---|---|
test_01_events_resource | Error |
416.93 | test_events_resource.py |