kata-containers runtime-rs: CPU hotplug tests failing with qemu-runtime-rs

On https://github.com/kata-containers/kata-containers/pull/9833 we start running k8s tests for qemu-runtime-rs (qemu implementation for runtime-rs) on AKS. At that time the following tests were disabled due to the lack of CPU hotplug support in the qemu implementation: k8s-cpu-ns.bats, k8s-number-cpus.bats and k8s-sandbox-vcpus-allocation.bats.

Soon after the https://github.com/kata-containers/kata-containers/pull/9772 introduced the basic CPU hotplug support in qemu-runtime-rs. So it's time to enable those tests on CI, i.e., if they were working, which isn't the case.

I ran k8s-cpu-ns.bats and k8s-number-cpus.bats locally but both failed. I didn't try k8s-sandbox-vcpus-allocation.bats but it should still not work for a reason not related with hotplugging but rather a runtime-rs algorithm implementation that differs from the runtime (golang). Let's defer the k8s-sandbox-vcpus-allocation.bats discussion for another occasion.

Allow me to narrow down the analysis for the k8s-cpu-ns.bats failure first:

[run_kubernetes_tests.sh:114] INFO: Executing k8s-cpu-ns.bats
k8s-cpu-ns.bats
 ✗ Check CPU constraints
   (in test file k8s-cpu-ns.bats, line 78)
     `[ "$total_cpus_container" -eq "$total_cpus" ]' failed
   [bats-exec-test:67] INFO: k8s configured to use runtimeclass
   pod/constraints-cpu-test created
   pod/constraints-cpu-test condition met
   Check the total of cpus is equal to 2
   total_cpus_container=1
   total_cpus_container=1
   total_cpus_container=1
   total_cpus_container=1
   total_cpus_container=1
   total_cpus_container=1
   total_cpus_container=1
   total_cpus_container=1
   total_cpus_container=1
   total_cpus_container=1
   Name:                constraints-cpu-test
   Namespace:           kata-containers-k8s-tests
   Priority:            0
   Runtime Class Name:  kata
   Service Account:     default
   Node:                aks-nodepool1-38648056-vmss000000/10.224.0.4
   Start Time:          Tue, 18 Jun 2024 16:19:37 -0300
   Labels:              <none>
   Annotations:         <none>
   Status:              Running
   IP:                  10.244.0.17
   IPs:
     IP:  10.244.0.17
   Containers:
     first-cpu-container:
       Container ID:  containerd://dea5df761ace18864d5f2ebab80ae7d0a9046d3c8397178a1c8649ffc23a1a22
       Image:         quay.io/prometheus/busybox:latest
       Image ID:      quay.io/prometheus/busybox@sha256:dfa54ef35e438b9e71ac5549159074576b6382f95ce1a434088e05fd6b730bc4
       Port:          <none>
       Host Port:     <none>
       Command:
         sleep
         30
       State:          Running
         Started:      Tue, 18 Jun 2024 16:19:39 -0300
       Ready:          True
       Restart Count:  0
       Limits:
         cpu:  1
       Requests:
         cpu:        500m
       Environment:  <none>
       Mounts:
         /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jf46p (ro)
   Conditions:
     Type              Status
     Initialized       True
     Ready             True
     ContainersReady   True
     PodScheduled      True
   Volumes:
     kube-api-access-jf46p:
       Type:                    Projected (a volume that contains injected data from multiple sources)
       TokenExpirationSeconds:  3607
       ConfigMapName:           kube-root-ca.crt
       ConfigMapOptional:       <nil>
       DownwardAPI:             true
   QoS Class:                   Burstable
   Node-Selectors:              katacontainers.io/kata-runtime=true
   Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                                node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                                node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
   Events:
     Type    Reason     Age               From               Message
     ----    ------     ----              ----               -------
     Normal  Scheduled  33s               default-scheduler  Successfully assigned kata-containers-k8s-tests/constraints-cpu-test to aks-nodepool1-38648056-vmss000000
     Normal  Pulled     31s               kubelet            Successfully pulled image "quay.io/prometheus/busybox:latest" in 131ms (131ms including waiting)
     Normal  Pulling    1s (x2 over 31s)  kubelet            Pulling image "quay.io/prometheus/busybox:latest"
     Normal  Created    0s (x2 over 31s)  kubelet            Created container first-cpu-container
     Normal  Started    0s (x2 over 31s)  kubelet            Started container first-cpu-container
     Normal  Pulled     0s                kubelet            Successfully pulled image "quay.io/prometheus/busybox:latest" in 111ms (111ms including waiting)
   pod "constraints-cpu-test" deleted

1 test, 1 failure

As you can note above, it's checked 10x the total of cpus allocated to the container (exec'ing the container and simply parsing /proc/cpuinfo). It's expected the total of container cpus to reach "2" but it's always "1", so the test fail.

Here is the pod-cpu.yaml deployment:

apiVersion: v1
kind: Pod
metadata:
  name: constraints-cpu-test
spec:
  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
  - name: first-cpu-container
    image: quay.io/prometheus/busybox:latest
    command:
        - sleep
        - "30"
    resources:
      limits:
        cpu: "1"
      requests:
        cpu: "500m"

And the runtimeclass overhead definitions:

Name:         kata-qemu-runtime-rs
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  node.k8s.io/v1
Handler:      kata-qemu-runtime-rs
Kind:         RuntimeClass
Metadata:
  Creation Timestamp:  2024-06-18T18:16:39Z
  Resource Version:    2103
  UID:                 edb4072b-0689-4fef-af4a-64f05fddff07
Overhead:
  Pod Fixed:
    Cpu:     250m
    Memory:  160Mi
Scheduling:
  Node Selector:
    katacontainers.io/kata-runtime:  true
Events:                              <none>

Using the quay.io/kata-containers/kata-deploy-ci@sha256:87a19392513d869ab393f59dfebf9aa9ee1c8410c2a3d4fac07852b9cce3597d image that was built 4 hours ago and should contain the cpu hotplug bits.

Jun 18 '24 20:06 wainersm

The journal messages (journalctl -t containerd | grep kata-runtime): logs.txt

Jun 18 '24 22:06 wainersm

Hey @wainersm , thanks for the comprehensive report!

Looking at the journal it's apparent that qemu-rs's vCPU hotplugging capability wasn't even called. The sandbox starts with a single coldplugged vCPU and a max of 2. No request for qemu-rs to hotplug additional vCPU(s) is ever made afterwards.

Looking at the k8s side of this, it appears that the container requests half a vCPU and limits itself to a single vCPU (is that correct? I'm not quite familiar with this ATM).

If that's the case I actually wouldn't expect this setup to hotplug vCPUs. runtime-rs counts cold-plugged vCPUs towards the desired vCPU count and indeed, when I test with a single-container pod where the container requests say 4 vCPUs, runtime-rs asks the hypervisor to hotplug only 3 vCPUs since it has one already.

So in the present test case, if the container requests 500m of vCPU and limits itself to one, runtime-rs won't see a need to hotplug anything since it has one (cold-plugged) vCPU already.

If that's correct then this test should fail with other runtime-rs hypervisors (dragonball, CH) as well. I'm also wondering if the go shim behaviour in this scenario is different - is it possible that it doesn't count cold-plugged vCPU(s) toward container requirements?

Jun 19 '24 09:06 pmores

Hi @pmores !

Hey @wainersm , thanks for the comprehensive report!

ywc!

Looking at the journal it's apparent that qemu-rs's vCPU hotplugging capability wasn't even called. The sandbox starts with a single coldplugged vCPU and a max of 2. No request for qemu-rs to hotplug additional vCPU(s) is ever made afterwards.

I noticed that too.

Looking at the k8s side of this, it appears that the container requests half a vCPU and limits itself to a single vCPU (is that correct? I'm not quite familiar with this ATM).

That's how I understood the test too.

If that's the case I actually wouldn't expect this setup to hotplug vCPUs. runtime-rs counts cold-plugged vCPUs towards the desired vCPU count and indeed, when I test with a single-container pod where the container requests say 4 vCPUs, runtime-rs asks the hypervisor to hotplug only 3 vCPUs since it has one already.

So in the present test case, if the container requests 500m of vCPU and limits itself to one, runtime-rs won't see a need to hotplug anything since it has one (cold-plugged) vCPU already.

But isn't this one (cold-plugged) counted as just a overhead?

If that's correct then this test should fail with other runtime-rs hypervisors (dragonball, CH) as well. I'm also wondering if the go shim behaviour in this scenario is different - is it possible that it doesn't count cold-plugged vCPU(s) toward container requirements?

It might indeed be a difference in behavior between the go and rust shims, resulting on the test failing on the later.

The k8s-cpu-ns.bats test is also disabled on both cloud-hypervisor and dragonball. For cloud-hypervisor, there is this https://github.com/kata-containers/kata-containers/issues/9039 that unfortunately doesn't provide any insights; as for dragonball, it seems that the test was disabled while waiting the vCPU resize implementation for dragonball and apparently it was forgotten to enable it (https://github.com/kata-containers/tests/pull/5392).

I will give it a try with dragonball and let you posted.

Jun 19 '24 19:06 wainersm

Hi @pmores !

Hey @wainersm , thanks for the comprehensive report!

ywc!

Looking at the journal it's apparent that qemu-rs's vCPU hotplugging capability wasn't even called. The sandbox starts with a single coldplugged vCPU and a max of 2. No request for qemu-rs to hotplug additional vCPU(s) is ever made afterwards.

I noticed that too.

Looking at the k8s side of this, it appears that the container requests half a vCPU and limits itself to a single vCPU (is that correct? I'm not quite familiar with this ATM).

That's how I understood the test too.

If that's the case I actually wouldn't expect this setup to hotplug vCPUs. runtime-rs counts cold-plugged vCPUs towards the desired vCPU count and indeed, when I test with a single-container pod where the container requests say 4 vCPUs, runtime-rs asks the hypervisor to hotplug only 3 vCPUs since it has one already. So in the present test case, if the container requests 500m of vCPU and limits itself to one, runtime-rs won't see a need to hotplug anything since it has one (cold-plugged) vCPU already.

But isn't this one (cold-plugged) counted as just a overhead?

If that's correct then this test should fail with other runtime-rs hypervisors (dragonball, CH) as well. I'm also wondering if the go shim behaviour in this scenario is different - is it possible that it doesn't count cold-plugged vCPU(s) toward container requirements?

It might indeed be a difference in behavior between the go and rust shims, resulting on the test failing on the later.

The k8s-cpu-ns.bats test is also disabled on both cloud-hypervisor and dragonball. For cloud-hypervisor, there is this #9039 that unfortunately doesn't provide any insights; as for dragonball, it seems that the test was disabled while waiting the vCPU resize implementation for dragonball and apparently it was forgotten to enable it (kata-containers/tests#5392).

I will give it a try with dragonball and let you posted.

Same error with dragonball: the container vcpu number is constantly one:

[run_kubernetes_tests.sh:114] INFO: Executing k8s-cpu-ns.bats
k8s-cpu-ns.bats
 ✗ Check CPU constraints
   (in test file k8s-cpu-ns.bats, line 77)
     `[ "$total_cpus_container" -eq "$total_cpus" ]' failed
   [bats-exec-test:67] INFO: k8s configured to use runtimeclass
   pod/constraints-cpu-test created
   pod/constraints-cpu-test condition met
   Check the total of cpus is equal to 2
   total_cpus_container=1
   total_cpus_container=1
   total_cpus_container=1
   total_cpus_container=1
   total_cpus_container=1
   total_cpus_container=1
   total_cpus_container=1
   total_cpus_container=1
   total_cpus_container=1
   total_cpus_container=1

Jun 19 '24 19:06 wainersm

Thanks for the additional info Wainer!

As for other hypervisors in runtime-rs, CH doesn't seem to have a vCPU resizing implementation at the moment so it makes sense that the test is disabled. Dragonball does seem to have it but it's subject to the same general runtime-rs vCPU counting logic as qemu so the test is expected to fail as you confirmed it does.

As far as runtime-go's logic, I'm unfortunately not very familiar with that code but here's what I can there (please double-check me as I might well be wrong :-)): the required vCPU count is obtained in Sandbox::UpdateResources() by summing up individual container requirements and adding the number of cold-plugged vCPUs right afterwards. This number is then passed to hypervisor's ResizeVCPUs().

Now qemu's ResizeVCPUs() implementation starts off by computing the current vCPU count as a sum of the number of cold-plugged vCPUs and currently hotplugged ones. Both vCPU counts (the required count supplied by the caller and current count just computed) are then compared and subtracted to get the number of vCPUs to hot(un)plug.

Since the number of cold-plugged vCPUs si part of both the required and current count as shown above, it cancels during the subtractions and is thus effectively ignored by the hotplugging logic of runtime-go. runtime-rs's behaviour could be replicated in runtime-go if Sandbox::UpdateResources() didn't include the number of cold-plugged vCPUs in the required vCPU count.

Jun 20 '24 07:06 pmores

Now the question is how to deal with this difference. In general, it seems to me that either one of the approaches is overall better so both runtimes should implement that one, or they are different but equal and CI then needs to reflect that in having slightly different tests for runtime-go & runtime-rs. Unfortunately I'm not sure which one it is. :-)

Jun 20 '24 07:06 pmores

Thanks for the additional info Wainer!

As for other hypervisors in runtime-rs, CH doesn't seem to have a vCPU resizing implementation at the moment so it makes sense that the test is disabled. Dragonball does seem to have it but it's subject to the same general runtime-rs vCPU counting logic as qemu so the test is expected to fail as you confirmed it does.

As far as runtime-go's logic, I'm unfortunately not very familiar with that code but here's what I can there (please double-check me as I might well be wrong :-)): the required vCPU count is obtained in Sandbox::UpdateResources() by summing up individual container requirements and adding the number of cold-plugged vCPUs right afterwards. This number is then passed to hypervisor's ResizeVCPUs().

I'm not familiar with runtime (go or rust) code nor the vpcu calculation algorithm. So here I'm widely guessing.

One question that I have is whether the default_vcpus=1 is reserved by Kata (agent...etc) or shared with container's demands.

The situation that we have here is:

1 vcpu is cold-plugged
250m vcpu is the pod overhead (IIUC this is the value reserved for kata's processes in guest)
500m vcpu is request by the container
1 vcpu is the limit imposed to the container

For the initial request, 500m + 250m vcpus the 1 vcpu cold-plugged is enough. But the container is limited to 1 vcpu, so then it should allocate (hot-plug) one more vcpu for in case the container reach the 1 vcpu limit (+ 250m of overhead). Did I get it wrong?

May the different be theRoundUpNumVCPUs call in oldCPUs, newCPUs, err := s.hypervisor.ResizeVCPUs(ctx, RoundUpNumVCPUs(sandboxVCPUs)) in https://github.com/kata-containers/kata-containers/blob/0f2a4d202e90b39b50074725b2cfe9c3088a4e20/src/runtime/virtcontainers/sandbox.go#L2286 (go runtime)?

@Apokleos @studychao could you please enlighten us with your knowledge?

Jun 21 '24 20:06 wainersm

One question that I have is whether the default_vcpus=1 is reserved by Kata (agent...etc) or shared with container's demands.

To sum up, say I have default_vcpus=1 in my configuration.toml and launch a two-container pod with cpu limits of 4 and 2 vCPUs, respectively.

With runtime-rs I'll get a kata VM with 6 vCPUs. This is supported by observation and source code reading as well (linked above).

With runtime-go the VM will most likely get 7 vCPUs. This is supported by source code reading (also linked above) and also the results of the test being discussed.

For the initial request, 500m + 250m vcpus the 1 vcpu cold-plugged is enough. But the container is limited to 1 vcpu, so then it should allocate (hot-plug) one more vcpu for in case the container reach the 1 vcpu limit (+ 250m of overhead). Did I get it wrong?

Frankly I'm not sure which behaviour is (more) correct. I couldn't find a formal definition of the k8s container resource specification but from the description of how it's used it seems that request is what the container is guaranteed to get while limit is what it might be given if resources are available.

If that's correct then in the example you give I think no rules of k8s resource management would be broken if the resulting VM had just one vCPU since that comfortably accomodates the 250m + 500m requirement. Granted, it would effective lower the container's limit to 750m but the container was never guaranteed to get anything over its request in the first place. (The same thing seems to happen when k8s scheduler puts a pod on a node that can barely accomodate the pods requests.)

That's my understanding anyway.

Jun 24 '24 08:06 pmores

One question that I have is whether the default_vcpus=1 is reserved by Kata (agent...etc) or shared with container's demands.

To sum up, say I have default_vcpus=1 in my configuration.toml and launch a two-container pod with cpu limits of 4 and 2 vCPUs, respectively.

With runtime-rs I'll get a kata VM with 6 vCPUs. This is supported by observation and source code reading as well (linked above).

With runtime-go the VM will most likely get 7 vCPUs. This is supported by source code reading (also linked above) and also the results of the test being discussed.

A few days later I got this genius idea: "should I look the project's docs?"

https://github.com/kata-containers/kata-containers/blob/main/docs/design/vcpu-handling-runtime-go.md
https://github.com/kata-containers/kata-containers/blob/main/docs/design/vcpu-handling-runtime-rs.md

ps: I liked the runtime-go documentation better.

I also realized that the overhead (agent..etc) is not used to compute vcpu size.

For the initial request, 500m + 250m vcpus the 1 vcpu cold-plugged is enough. But the container is limited to 1 vcpu, so then it should allocate (hot-plug) one more vcpu for in case the container reach the 1 vcpu limit (+ 250m of overhead). Did I get it wrong?

Frankly I'm not sure which behaviour is (more) correct. I couldn't find a formal definition of the k8s container resource specification but from the description of how it's used it seems that request is what the container is guaranteed to get while limit is what it might be given if resources are available.

If that's correct then in the example you give I think no rules of k8s resource management would be broken if the resulting VM had just one vCPU since that comfortably accomodates the 250m + 500m requirement. Granted, it would effective lower the container's limit to 750m but the container was never guaranteed to get anything over its request in the first place. (The same thing seems to happen when k8s scheduler puts a pod on a node that can barely accomodate the pods requests.)

That's my understanding anyway.

Jun 26 '24 20:06 wainersm

Ok, on the face of the new discoveries... I think we can do either:

keep skipping the tests. Only adjust the skip message to the right reason we have the tests disable
when runtime-rs takes over, re-write the tests and enable them
adapt the tests to handle rust and go runtime implementations

Any preference?

And close this issue as "NOT A BUG".

Jun 26 '24 20:06 wainersm

Closing as not a bug.

Jul 15 '24 18:07 wainersm