Changing CPU overprovisioning factor breaks prometheus and listHosts usage metrics
ISSUE TYPE
- Bug Report
COMPONENT NAME
Prometheus Exporter
API
CLOUDSTACK VERSION
4.17.0
4.18.0
CONFIGURATION
N/A
OS / ENVIRONMENT
N/A
SUMMARY
When CPU overcommit factor is changed, the prometheus exporter metric "cloudstack_host_cpu_usage_mhz_total" as well as API response of listHosts (cpuused field) seems to be multiplied to the new overcommit factor.
The actual "used" metrics should not be affected by overcommit factor. Overcommit factor should only virtually increase the capacity the node has, and not affecting the usage metric.
STEPS TO REPRODUCE
1. Empty out a hypervisors from VMs, VRs, systemvms etc. So there is no virtual machines running on it.
2. Pick a virtual machine to start on that hypervisor. Before starting, note the amount of CPU cores and CPU Mhz it has, e.g two cores 500Mhz each.
3. After you have started the test virtual machine on the test hypervisor, check the Prometheus cloudstack_host_cpu_usage_mhz_total{hostname=<your test hypervisor metric>}. It should show the CPU Mhz used on that hypervisor: cpu_number * cpu_mhz, e.g. 1000. This is the correct value.
4. Now change the cluster setting cpu.overprovisioning.factor to a new value, e.g. 4.
5. The cloudstack_host_cpu_usage_mhz_total{hostname=<your test hypervisor metric>} now shows different value, presumably calulated by the formula: cpu_number * cpu_mhz * (new_overprovisioning_factor - old_overprovisioning_factor)
6. If you stop and start the test VM, then the cloudstack_host_cpu_usage_mhz_total goes back to normal.
Same reproduce steps can be applied to the API response of listHosts call, field cpuused.
If you start a VM, then change overprovisioning factor, the field will contain incorrect value (especially if you put ridiculously high overprovisioning factor value, such as 1000).
EXPECTED RESULTS
The prometheus metric cloudstack_host_cpu_usage_mhz_total and API response of listHosts call (field cpuused) should not contain overprovisioning factor in their calculation as usage reports report on real usage.
ACTUAL RESULTS
The metric is reported without overprovisioning factor in its calculation when a VM starts, then gets distorted when you change overprovisioning factor.
@phsm The docs do say that you need to stop and start the VMs:
http://docs.cloudstack.apache.org/en/latest/adminguide/hosts.html?highlight=over-provisioning#setting-over-provisioning-factors
Only VMs deployed after the change are affected by the new setting. If you want VMs deployed before the change to adopt the new over-provisioning factor, you must stop and restart the VMs. When this is done, CloudStack recalculates or scales the used and reserved capacities based on the new over-provisioning factors, to ensure that CloudStack is correctly tracking the amount of free capacity.
@NuxRo Thanks for reviewing this bug report.
I think you've misunderstood what is the actual bug here.
The bug is that the overprovisioning factor is affecting the cloudstack_host_cpu_usage_mhz_total metric when you change the overprovisioning factor while it shouldn't.
- Lets consider you have some overprovisioning factor set (e.g. 2 or 10).
- When you start a VM with 2 cores, 2000Mhz on an empty hypervisor, then the metric cloudstack_host_cpu_usage_mhz_total for that hypervisor will show you the value of 4000. This is correct behavior: the metric shows you the real usage, not multiplied by overprovisioning factor.
- When you change the overprovisioning factor while the VM is running, the metric values becomes multiplied on what seems to be
(new_overprovision_factor - old overprovision_factor)
So this is inconsistency in the behavior: usage metric is not affected by overprovisioning factor when you start a VM. When you change the overprovisioning factor, then the VM usage becomes affected by overprovisioning factor.
@phsm Yes, it is a problem. This behaviour is unfortunate.
Re-opened, it seems #7629 fixes different issue as mentioned here.
Re-opened, it seems #7629 fixes different issue as mentioned here.
Yes @sureshanaparti it is a different issue.
@soreana , are you still looking at this?