cloudstack icon indicating copy to clipboard operation
cloudstack copied to clipboard

Do not retrieve VM's stats on normal VM listing

Open JoaoJandre opened this issue 11 months ago • 14 comments

Description

For both the listVirtualMachines and listVirtualMachineMetrics APIs, by default, ACS queries all VM details (the details parameter is default all) and consequently queries the cloud.vm_stats table for each VM, creating a summary of statistics. Therefore, every process carried out in the UI that uses the listVirtualMachines API also loads the VM statistics, which can cause slowdowns depending on the number of records.

A new configuration return.vm.stats.on.vm.list (with default true), was added, when it is false, the listVirtualMachines API will not list the VM's metrics by default.

Types of changes

  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] New feature (non-breaking change which adds functionality)
  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [X] Enhancement (improves an existing feature and functionality)
  • [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
  • [ ] build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • [X] Major
  • [ ] Minor

Bug Severity

  • [ ] BLOCKER
  • [ ] Critical
  • [ ] Major
  • [ ] Minor
  • [ ] Trivial

How Has This Been Tested?

Called listVirtualMachines API, without any parameters:

  • before this PR, the all the VM's details were returned;
  • with this PR and return.vm.stats.on.vm.list as true (default), all the details were returned as well;
  • with this PR and return.vm.stats.on.vm.list as false, all the details except the metrics were returned.

JoaoJandre avatar Mar 13 '24 18:03 JoaoJandre

Codecov Report

Attention: Patch coverage is 47.82609% with 12 lines in your changes are missing coverage. Please review.

Project coverage is 30.93%. Comparing base (a7ec873) to head (c711a39). Report is 7 commits behind head on 4.19.

Files Patch % Lines
...che/cloudstack/api/command/user/vm/ListVMsCmd.java 41.66% 6 Missing and 1 partial :warning:
...a/org/apache/cloudstack/api/ListVMsMetricsCmd.java 33.33% 1 Missing and 1 partial :warning:
...apache/cloudstack/api/response/UserVmResponse.java 80.00% 1 Missing :warning:
ui/src/config/section/compute.js 0.00% 1 Missing :warning:
ui/src/views/AutogenView.vue 0.00% 0 Missing and 1 partial :warning:
Additional details and impacted files
@@            Coverage Diff            @@
##               4.19    #8782   +/-   ##
=========================================
  Coverage     30.93%   30.93%           
- Complexity    34263    34276   +13     
=========================================
  Files          5353     5354    +1     
  Lines        376055   376100   +45     
  Branches      54691    54697    +6     
=========================================
+ Hits         116317   116346   +29     
- Misses       244443   244453   +10     
- Partials      15295    15301    +6     
Flag Coverage Δ
simulator-marvin-tests 24.76% <52.38%> (+<0.01%) :arrow_up:
uitests 4.39% <0.00%> (-0.01%) :arrow_down:
unit-tests 16.58% <0.00%> (-0.01%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Mar 13 '24 18:03 codecov[bot]

@JoaoJandre I suggest to check/fix the API calls from UI to list vms or vm metrics.

It is not good to change the default behavior if it is not a bug, IMHO.

weizhouapache avatar Mar 13 '24 18:03 weizhouapache

For both the listVirtualMachines and listVirtualMachineMetrics APIs, by default, ACS queries all VM details (the details parameter is default all) and consequently queries the cloud.vm_stats table for each VM,

This is because the listVirtualMachinesMetrics API (and all Metrics API) are superset of the non-metrics APIs. All the Metrics APIs add something related to metrics in their API response in addition to what a non-metrics API would return. I think we shouldn't change the default behaviour of the non-metrics API - and if you really have a use-case for this see if there's any other way to do what you're trying to accomplish or worst-case add a global setting that allows the behaviour you want but keep the setting's default in such a way that it continue old behaviour for other users.

rohityadavcloud avatar Mar 14 '24 12:03 rohityadavcloud

@JoaoJandre I suggest to check/fix the API calls from UI to list vms or vm metrics.

It is not good to change the default behavior if it is not a bug, IMHO.

@weizhouapache, @sureshanaparti and @rohityadavcloud, we could change only the UI behavior; however, why do we have this default behavior? We do not use these metrics that are returned on this API, when the metrics are used, users, operators and the UI use the listVirtualMachineMetrics API. That is, we already have an API that has this exact purpose.

Not long ago we changed the default behavior of the deleteTemplate API on #7731; As with this situation, the one on #7731 wasn't a bug, it was an inconvenient behavior. A user/operator that only wants to list the VMs intuitively will use the listVirtualMachines API; however, depending on the environment's size, he will deal with slowness because of the default behavior. As with #7731, we could only a change the UI, but changing the default behavior makes more sense.

JoaoJandre avatar Mar 14 '24 13:03 JoaoJandre

Codecov Report

Attention: Patch coverage is 3.22581% with 30 lines in your changes are missing coverage. Please review.

Project coverage is 14.96%. Comparing base (a7ec873) to head (e0f7462). Report is 109 commits behind head on 4.19.

Files Patch % Lines
...che/cloudstack/api/command/user/vm/ListVMsCmd.java 0.00% 17 Missing :warning:
...apache/cloudstack/api/response/UserVmResponse.java 0.00% 8 Missing :warning:
...a/org/apache/cloudstack/api/ListVMsMetricsCmd.java 0.00% 4 Missing :warning:
...n/java/com/cloud/api/query/ViewResponseHelper.java 0.00% 1 Missing :warning:
Additional details and impacted files
@@              Coverage Diff              @@
##               4.19    #8782       +/-   ##
=============================================
- Coverage     30.93%   14.96%   -15.97%     
+ Complexity    34263    10989    -23274     
=============================================
  Files          5353     5373       +20     
  Lines        376055   469034    +92979     
  Branches      54691    57597     +2906     
=============================================
- Hits         116317    70177    -46140     
- Misses       244443   391087   +146644     
+ Partials      15295     7770     -7525     
Flag Coverage Δ
simulator-marvin-tests ?
uitests 4.31% <ø> (-0.08%) :arrow_down:
unit-tests ?
unittests 15.66% <3.22%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar May 03 '24 15:05 codecov-commenter

@blueorangutan package

JoaoJandre avatar May 06 '24 17:05 JoaoJandre

@JoaoJandre a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar May 06 '24 17:05 blueorangutan

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9549

blueorangutan avatar May 06 '24 18:05 blueorangutan

@rohityadavcloud , @weizhouapache I added a configuration to control the behavior of the listVirtualMachines API and updated the description. Now the default behavior is maintained. I hope we can advance the discussion on #8970 so we can create a proper mechanism to change the default behavior in the future.

JoaoJandre avatar May 06 '24 18:05 JoaoJandre

@blueorangutan test rocky8 kvm-rocky8

weizhouapache avatar May 06 '24 18:05 weizhouapache

@weizhouapache a [SL] Trillian-Jenkins test job (rocky8 mgmt + kvm-rocky8) has been kicked to run smoke tests

blueorangutan avatar May 06 '24 18:05 blueorangutan

@rohityadavcloud , @weizhouapache I added a configuration to control the behavior of the listVirtualMachines API and updated the description. Now the default behavior is maintained. I hope we can advance the discussion on #8970 so we can create a proper mechanism to change the default behavior in the future.

thanks @JoaoJandre for the update

weizhouapache avatar May 06 '24 18:05 weizhouapache

[SF] Trillian test result (tid-10170) Environment: kvm-rocky8 (x2), Advanced Networking with Mgmt server r8 Total time taken: 49325 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8782-t10170-kvm-rocky8.zip Smoke tests completed. 129 look OK, 2 have errors, 0 did not run Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_events_resource Error 418.29 test_events_resource.py
test_01_restore_vm Error 0.24 test_restore_vm.py
test_02_restore_vm_allocated_root Error 0.18 test_restore_vm.py
ContextSuite context=TestRestoreVM>:teardown Error 1.26 test_restore_vm.py

blueorangutan avatar May 07 '24 09:05 blueorangutan

I like the idea of splitting which API is called for metrics vs non-metrics list view (I might or others should steal the pattern for all metrics API usage across the UI). I'm not fully satisfied with the PR @JoaoJandre yet, and I would rather encourage you can pick some ideas from #8985 - that said, I'm also inclined to make progress in a cordial and mature manner. It would be easier to get this merge and optimise the general solution further as required (by me or others).

I'm working on a much wider PR that's in research and progress, that's around wider scalability issues of CloudStack (surprisingly I'm near the root cause, and it maybe possible to even get stats without much penalty). All that said - I wouldn't remember everything I write on Github on each and every PR, take my review with a pinch of salt and I may change my views on things as I'm dealing with a wider scalability problem. I'll leave some comments, but think let's go ahead.

rohityadavcloud avatar May 08 '24 15:05 rohityadavcloud

@sureshanaparti ok by you like this?

DaanHoogland avatar May 18 '24 19:05 DaanHoogland

@blueorangutan package

DaanHoogland avatar May 18 '24 19:05 DaanHoogland

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar May 18 '24 19:05 blueorangutan

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✖️ debian ✔️ suse15. SL-JID 9631

blueorangutan avatar May 18 '24 20:05 blueorangutan

ping @sureshanaparti

DaanHoogland avatar May 22 '24 08:05 DaanHoogland

@blueorangutan package

DaanHoogland avatar May 22 '24 08:05 DaanHoogland

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar May 22 '24 08:05 blueorangutan

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9663

blueorangutan avatar May 22 '24 09:05 blueorangutan

@blueorangutan test

DaanHoogland avatar May 22 '24 12:05 DaanHoogland

@DaanHoogland a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan avatar May 22 '24 12:05 blueorangutan

[SF] Trillian test result (tid-10254) Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7 Total time taken: 43718 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8782-t10254-kvm-centos7.zip Smoke tests completed. 130 look OK, 1 have errors, 0 did not run Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_events_resource Error 416.93 test_events_resource.py

blueorangutan avatar May 23 '24 06:05 blueorangutan