cloudstack Problem with operating systems that use cgroup v2 related to cpu speed.

ISSUE TYPE

Improvement Request

COMPONENT NAME

Cloudstack management and cloudstack agent.

CLOUDSTACK VERSION

4.17.0.1

CONFIGURATION

2 management servers, 2 databases, advanced network, everything working fine.

OS / ENVIRONMENT

Ubuntu Server 22.04 LTS, KVM, libvirt 8.0.0-1ubuntu7.1, cgroup2.

SUMMARY

The value of "CPU (in MHz)" used in some compute offering definitions is mapped to 'shares' element in 'cputune' section of domain definition file (xml). According to libvirt documentation the value should be in range [2, 262144]. But, for operating systems using cgroup v2 the maximum value is 10000. I know that Ubuntu 22.04 I'm using here is not supported yet. But this will be an issue as other OSs adopt cgroup v2 too. So, I think this parameter deserves attention. If the value of (N. CPU) * CPU (in MHz) is greater than 10000 I get "Value specified in CPUWeight is out of range" in hypervisor.

As a workaround I configured service offering with 1 Mhz. This implies that VMs with more CPUs have much more chance to get the CPU than VMs with lower number of CPUs, because besides having more CPUs, they have more chance to get the host' CPUs.

VM1 10 CPU * 1 MHz -> shares = 10 VM2 80 CPU * 1 MHz -> shares = 80

If we look at 1st CPU of VM2, it will have 8 times more chance to get the host's CPU than 1st CPU of VM1.

Shouldn't all virtual CPUs have the same chance of using one host CPU?

STEPS TO REPRODUCE

Using Ubuntu 22.04 or other OS that uses cgroup v2, create a compute offering with a CPU (in MHz) value of 1000. Try to launch an instance with more than 10 CPUs. It will be generated a configuration file for the new instance with shares > 10000 inside cputune section. Hypervisor can't launch the instance with Value specified in CPUWeight is out of range error.

EXPECTED RESULTS

Create a way for all virtual CPUs to have an equal chance of using a host CPU. Check cgroup version and try to not generate values bigger than 10000.

ACTUAL RESULTS

Working in this way when the CPU (in MHz) is mapped to the 'shares' element, instances with fewer CPUs are always penalized. Sometimes hosts can't launch instances.

Sep 15 '22 20:09 correajl

Thanks for the report - I think you're right that we need to do something here about the values potentially going out of range.

To the second point - a VM with 10 shares and a VM with 80 shares: I think the problem lies in that we have to keep the values intact on the service offerings in order to make the allocation math work. In your scenario the allocator would only think it has allocated 90MHz of a host that probably has 100GHz or more to allocate.

In your scenario I think the weighting itself still works right in groups at the host. Just to make the numbers easier - if VM1 had 20 vCPUs and 20 shares and VM2 had 80vCPUs and 80 shares, when the scheduler breaks down the CPU scheduling into runtime periods and assuming no other workloads involved, it means VM1 gets 20% of each runtime period and VM2 gets 80% of each runtime period. On a (fictional) 100 core hypervisor host, this would mean VM2 gets ~20 cores worth of the system's CPU time and VM2 gets ~80 cores worth (not exactly and not implying pinning to real cores necessarily, just in regards to scheduler's view of CPU time per period considering all cores).

The bigger problem is really that this 100 core host has maybe 200GHz worth of CPU to allocate, and with 1MHz CPU offerings cloudstack calculates that you have only scheduled 100Mhz to it! The allocators will quickly overload the system with more VMs. And the whole concept of CPUs having differing speeds is thrown out if we just map 1 CPU to 1 MHz.

My initial thought to fix this is to simply scale down the shares number that is applied at libvirt. Not so much that we can't offer different levels of performance, though.

Simple example, scale down by factor of 100:

2 vCPU x 2000Mhz offering = 4000MHz = 40 shares 4 vCPU x 500Mhz offering = 2000MHz = 20 shares ... 128 core x 2000MHz offering = 2560 shares

This seems to give us a reasonable enough resolution to maintain the share weighting and also handle differing MHz speeds in the CPU offerings, which would be important for service offerings that enforce these shares as a CPU cap (via CFS quota). That is, a 1vCPU 500MHz offering with CPU cap enabled should get 1/4 of the runtime per period that a 1vCPU 2000MHz offering gets, and that doesn't work if we just map 1 CPU to 1 share.

Sep 16 '22 15:09 mlsorensen

The bigger problem is really that this 100 core host has maybe 200GHz worth of CPU to allocate, and with 1MHz CPU offerings cloudstack calculates that you have only scheduled 100Mhz to it! The allocators will quickly overload the system with more VMs. And the whole concept of CPUs having differing speeds is thrown out if we just map 1 CPU to 1 MHz.

I currently can't see how this might happen as the shares aren't the only factor, or if they are really used, for the allocation decisions of cloudstack. Each of the deployed instances using the compute offering will at least block 1 core. So after deploying theoretical 100 machines using the 1Core / 1Mhz offering, the host is "full". Therefore one might check the global settings for cpu.overprovisiong.factor. The mapping of CPU (in MHz) works quiet well as long as n.Cores * CPU (in MHz) !> 262144. Which means for the vm you would like to deploy, shouldn't have more then 262144MHz allocated - which i guess could be sufficent for most of usecases ;-) I can see why a "10000" as a reference value won't be sufficent here as @correajl wrote. In my deployment i have a base CPU Freq. of 2650. So i could only deploy a 3 core VM with 2650MHz. That isn't that much....

Reagarding the use of the value 'shares' for libvirt: I don't think that the approach will work out as the overall value for shares can be in theory '262144' for each Domain. Which means each Domain will have the same CPU time. '262144' isn't representing the acutal availeable CPU time, it's used to generate a 'proportional weighted share'. Not weighting the actual availeable ressources, more the share between the different Domains. A short example based on the libvirt docs: Domain A - shares = 1024 Domain B - shares = 2048 Domain C - shares = 4096

Domain B shall have 2x the CPU time as Domain A. Domain C shall have 4x the CPU time as Domain A and 2x CPU Time as Domain B.

To allocate the CPU time, one then needs to sum the factors to get to the number of "Slots" and then calculate the actual cpu time. so for the example: 1 (Domain A) + 2 (Domain B) + 4 (Domain C) = 7 Slots

And then: Availeable CPU time / 7 Slots * Factor for each Domain = CPU Time for each Domain (Using '100s' represeinting the overall availeable CPU time for simplification)

100s / 7 = 14,28s

Domain A: 14,28 * 1 = 14,28s Domain B: 14,28 * 2 = 28,56s Domain C: 14,28 * 4 = 57,12s

This works quiet well with the CPU (in MHz) mapped to the value of shares in the Domain XML - as written above - the n.Cores * CPU (in MHz) !> 262144. But keep in mind that this is only relevant when you start overprovisioning. At least the the cloudstack docs state, that without overprovisioning the values for 'shares' aren't of importance:

This value is also passed to the hypervisor as a share value to give VMs relative priority when a hypervisor host is over-provisioned.

Sep 17 '22 10:09 Hudratronium

I currently can't see how this might happen as the shares aren't the only factor, or if they are really used, for the allocation decisions of cloudstack. Each of the deployed instances using the compute offering will at least block 1 core.

Thanks for your reply @Hudratronium. Shares themselves aren't used for allocation decisions as they're a KVM implementation detail. However, setting 1MHz as your CPU speed in the offering as an effort to manipulate the resulting shares means that an 8 core VM is only going to subtract 8MHz of capacity from your multi-Ghz host.

Unless something has changed, CloudStack will not block 1 physical core on the host per vCPU. The cloudstack allocators just sum the MHz available on the host (host cores * speed) and sum the MHz required to run the offering, and subtract each instance's required MHz from the available MHz on the hypervisor host. I know that at some point in the last decade someone added the concept of "CPU cores" as a distinct resource to CloudStack, visible on the dashboard, but you can easily exceed that number.

There is a condition that ensures the number of vCPUs is <= the number of physical cores on the host, but you could deploy any number of 8vCPU, 1Mhz VMs on an 8 physical core host. As a side note, to me it seems kind of broken actually (or should be configurable) to require the host to have at least as many physical cores as the VM has vCPUs, but I digress :-)

See this example. I have created an 8vCPU 1MHz service offering, and I have an 8 core KVM host:

> list hosts id=617cb1f7-7729-49d8-96ec-f412aca539a4 filter=cpuallocatedvalue,cpunumber,cpusockets,cpuspeed,cpuwithoverprovisioning
{
  "count": 1,
  "host": [
    {
      "cpuallocatedvalue": 16,
      "cpunumber": 8,
      "cpusockets": 1,
      "cpuspeed": 2800,
      "cpuwithoverprovisioning": "22400"
    }
  ]
}

See 8 CPUs * 2.8GHz = 22400 total CPU to allocate. It has 16MHz allocated right now. If I see which VMs are allocated, I have two of these 8vCPU 1MHz, 16 vCPUs (thus 16MHz allocated) on an 8 CPU host .

> list virtualmachines hostid=617cb1f7-7729-49d8-96ec-f412aca539a4 filter=cpunumber,cpuspeed,instancename
{
  "count": 2,
  "virtualmachine": [
    {
      "cpunumber": 8,
      "cpuspeed": 1,
      "instancename": "i-2-131-VM"
    },
    {
      "cpunumber": 8,
      "cpuspeed": 1,
      "instancename": "i-2-130-VM"
    }
  ]
}

I could launch potentially thousands of these 1MHz VMs with the 22400MHz the host has (it would run out of memory first).

I don't think that the approach will work out as the overall value for shares can be in theory '262144' for each Domain. Which means each Domain will have the same CPU time. '262144' isn't representing the acutal availeable CPU time, it's used to generate a 'proportional weighted share'. Not weighting the actual availeable ressources, more the share between the different Domains. A short example based on the libvirt docs:

Yes, this is why a scale down should work fine, as the proportion is what is important, not the raw value. Just scaling down by a factor of 100 when defining the shares in libvirt should work. The main issue is lack of granularity as anything lower than 100MHz offering would still get 1 share. Probably not a real issue :-)

But keep in mind that this is only relevant when you start overprovisioning. At least the the cloudstack docs state, that without overprovisioning the values for 'shares' aren't of importance

It's also relevant (or should be if it is working properly) with CPU cap enabled on service offerings. This limits the VM CPU to a certain amount of runtime quota per period (using the cgroup CFS settings) regardless of whether there is contention on the system. This can be useful to provide consistency in experience - for example you can limit VM performance to 1/2 of a physical core and in theory would always be the same level of performance so long as you don't over provision greater than 2x, rather than bursting to consume whole physical cores when idle.

Sep 19 '22 17:09 mlsorensen

Unless something has changed, CloudStack will not block 1 physical core on the host per vCPU. The cloudstack allocators just sum the MHz available on the host (host cores * speed) and sum the MHz required to run the offering, and subtract each instance's required MHz from the available MHz on the hypervisor host. I know that at some point in the last decade someone added the concept of "CPU cores" as a distinct resource to CloudStack, visible on the dashboard, but you can easily exceed that number.

You are completely right - looking at a trace-log i have to admitt that I missed the step where the "definition" of cpu switched from cores & speed to the actual used product of cores * frequency for allocation purposes. Overall reading also the documentation, i would have expected a check between the allocated cores and the provided physical ressources. Especially since the time when CPU OEMs introduced hyperthreading, one might have to deal with "vCPUs passed to hypervisors".

Yes, this is why a scale down should work fine, as the proportion is what is important, not the raw value. Just scaling down by a factor of 100 when defining the shares in libvirt should work. The main issue is lack of granularity as anything lower than 100MHz offering would still get 1 share. Probably not a real issue :-)

Not that i think this will ever happen, but at least you will need to use 200MHz otherwise the value would be to low to be accepted as a 'shares' value :-D

Sep 19 '22 21:09 Hudratronium

@Hudratronium @mlsorensen @correajl Guys, most of this discussion has gone over my head, but it sounds like a good thing to solve before 4.18. Is there any plan by anybody to implement this?

Sep 27 '22 09:09 DaanHoogland

Sadly i am lacking the programming skills to work on this... Nevertheless - since CS now officially supports Ubuntu 22.04 this could get important.

Sep 28 '22 20:09 Hudratronium

assigned to myself.

currently, the cpu shares of vm = cpu cores * cpu speed (speed can regarded as the cpu weight factor)

As @correajl mentioned, According to libvirt documentation the value should be in range [2, 262144]. the vm speed cannot be larger than the physical cpu speed of kvm host (normally between 2000~4000 MHz), if the number of cpu cores is larger than 132 (speed=2000) or 66 (speed=4000). the cpu share will be more than 262144.

my idea is,

add a global setting e.g. `kvm.cpu.shares.factor" (1 by default, it can be a cluster setting).
cpu shares will be calculated by cpu shares = cpu cores * cpu speed / kvm.cpu.shares.factor
when global setting is updated, it will be propogated to kvm agents, and update cpu shares of all vms.
when kvm agent is connected, it gets the factor from management server and update cpu shares of all vms on the host.

any suggestions ? @mlsorensen @correajl @Hudratronium @DaanHoogland

Jan 13 '23 19:01 weizhouapache

It is even more 'worse' as @correajl pointed out the max. value for OS using cgroupsV2 is 10000.

IMHO that should work but will need some checks to not exeed the value of 10000 as well as you will have to take care to round the resultant values - which then could lead to some kind of round-off / -up error when looking at the really availeable speeds for a machine (Meaning you could get always a little bit "less" performance as specified in a service offering.)

It would work as kind of a 'workaround' at the moment, but depending on the scaledown factor, the impact for "smallish" cpu-offerings could be quiet big - and it seems we can't avoid this as long as Ubuntu shall be supported as a host-OS.

I can't see a really nice solution as at some point the concepts of 'representig physical capabilities' and using 'proportional weighted shares' will at some point lead to some kind of tradeoff - even if one is going to do the math and logic to do this on a 'per host / number of vm's / cpu capabilities' approach you will have some trade-offs....

Jan 13 '23 20:01 Hudratronium

cloudstack cloudstack copied to clipboard

Problem with operating systems that use cgroup v2 related to cpu speed.

ISSUE TYPE

COMPONENT NAME

CLOUDSTACK VERSION

CONFIGURATION

OS / ENVIRONMENT

SUMMARY

STEPS TO REPRODUCE

EXPECTED RESULTS

ACTUAL RESULTS

cloudstack
cloudstack copied to clipboard