runtime-rs: CPU hotplug tests failing with qemu-runtime-rs
On https://github.com/kata-containers/kata-containers/pull/9833 we start running k8s tests for qemu-runtime-rs (qemu implementation for runtime-rs) on AKS. At that time the following tests were disabled due to the lack of CPU hotplug support in the qemu implementation: k8s-cpu-ns.bats, k8s-number-cpus.bats and k8s-sandbox-vcpus-allocation.bats.
Soon after the https://github.com/kata-containers/kata-containers/pull/9772 introduced the basic CPU hotplug support in qemu-runtime-rs. So it's time to enable those tests on CI, i.e., if they were working, which isn't the case.
I ran k8s-cpu-ns.bats and k8s-number-cpus.bats locally but both failed. I didn't try k8s-sandbox-vcpus-allocation.bats but it should still not work for a reason not related with hotplugging but rather a runtime-rs algorithm implementation that differs from the runtime (golang). Let's defer the k8s-sandbox-vcpus-allocation.bats discussion for another occasion.
Allow me to narrow down the analysis for the k8s-cpu-ns.bats failure first:
[run_kubernetes_tests.sh:114] INFO: Executing k8s-cpu-ns.bats
k8s-cpu-ns.bats
✗ Check CPU constraints
(in test file k8s-cpu-ns.bats, line 78)
`[ "$total_cpus_container" -eq "$total_cpus" ]' failed
[bats-exec-test:67] INFO: k8s configured to use runtimeclass
pod/constraints-cpu-test created
pod/constraints-cpu-test condition met
Check the total of cpus is equal to 2
total_cpus_container=1
total_cpus_container=1
total_cpus_container=1
total_cpus_container=1
total_cpus_container=1
total_cpus_container=1
total_cpus_container=1
total_cpus_container=1
total_cpus_container=1
total_cpus_container=1
Name: constraints-cpu-test
Namespace: kata-containers-k8s-tests
Priority: 0
Runtime Class Name: kata
Service Account: default
Node: aks-nodepool1-38648056-vmss000000/10.224.0.4
Start Time: Tue, 18 Jun 2024 16:19:37 -0300
Labels: <none>
Annotations: <none>
Status: Running
IP: 10.244.0.17
IPs:
IP: 10.244.0.17
Containers:
first-cpu-container:
Container ID: containerd://dea5df761ace18864d5f2ebab80ae7d0a9046d3c8397178a1c8649ffc23a1a22
Image: quay.io/prometheus/busybox:latest
Image ID: quay.io/prometheus/busybox@sha256:dfa54ef35e438b9e71ac5549159074576b6382f95ce1a434088e05fd6b730bc4
Port: <none>
Host Port: <none>
Command:
sleep
30
State: Running
Started: Tue, 18 Jun 2024 16:19:39 -0300
Ready: True
Restart Count: 0
Limits:
cpu: 1
Requests:
cpu: 500m
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jf46p (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-jf46p:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: katacontainers.io/kata-runtime=true
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 33s default-scheduler Successfully assigned kata-containers-k8s-tests/constraints-cpu-test to aks-nodepool1-38648056-vmss000000
Normal Pulled 31s kubelet Successfully pulled image "quay.io/prometheus/busybox:latest" in 131ms (131ms including waiting)
Normal Pulling 1s (x2 over 31s) kubelet Pulling image "quay.io/prometheus/busybox:latest"
Normal Created 0s (x2 over 31s) kubelet Created container first-cpu-container
Normal Started 0s (x2 over 31s) kubelet Started container first-cpu-container
Normal Pulled 0s kubelet Successfully pulled image "quay.io/prometheus/busybox:latest" in 111ms (111ms including waiting)
pod "constraints-cpu-test" deleted
1 test, 1 failure
As you can note above, it's checked 10x the total of cpus allocated to the container (exec'ing the container and simply parsing /proc/cpuinfo). It's expected the total of container cpus to reach "2" but it's always "1", so the test fail.
Here is the pod-cpu.yaml deployment:
apiVersion: v1
kind: Pod
metadata:
name: constraints-cpu-test
spec:
terminationGracePeriodSeconds: 0
runtimeClassName: kata
containers:
- name: first-cpu-container
image: quay.io/prometheus/busybox:latest
command:
- sleep
- "30"
resources:
limits:
cpu: "1"
requests:
cpu: "500m"
And the runtimeclass overhead definitions:
Name: kata-qemu-runtime-rs
Namespace:
Labels: <none>
Annotations: <none>
API Version: node.k8s.io/v1
Handler: kata-qemu-runtime-rs
Kind: RuntimeClass
Metadata:
Creation Timestamp: 2024-06-18T18:16:39Z
Resource Version: 2103
UID: edb4072b-0689-4fef-af4a-64f05fddff07
Overhead:
Pod Fixed:
Cpu: 250m
Memory: 160Mi
Scheduling:
Node Selector:
katacontainers.io/kata-runtime: true
Events: <none>
Using the quay.io/kata-containers/kata-deploy-ci@sha256:87a19392513d869ab393f59dfebf9aa9ee1c8410c2a3d4fac07852b9cce3597d image that was built 4 hours ago and should contain the cpu hotplug bits.
The journal messages (journalctl -t containerd | grep kata-runtime): logs.txt
Hey @wainersm , thanks for the comprehensive report!
Looking at the journal it's apparent that qemu-rs's vCPU hotplugging capability wasn't even called. The sandbox starts with a single coldplugged vCPU and a max of 2. No request for qemu-rs to hotplug additional vCPU(s) is ever made afterwards.
Looking at the k8s side of this, it appears that the container requests half a vCPU and limits itself to a single vCPU (is that correct? I'm not quite familiar with this ATM).
If that's the case I actually wouldn't expect this setup to hotplug vCPUs. runtime-rs counts cold-plugged vCPUs towards the desired vCPU count and indeed, when I test with a single-container pod where the container requests say 4 vCPUs, runtime-rs asks the hypervisor to hotplug only 3 vCPUs since it has one already.
So in the present test case, if the container requests 500m of vCPU and limits itself to one, runtime-rs won't see a need to hotplug anything since it has one (cold-plugged) vCPU already.
If that's correct then this test should fail with other runtime-rs hypervisors (dragonball, CH) as well. I'm also wondering if the go shim behaviour in this scenario is different - is it possible that it doesn't count cold-plugged vCPU(s) toward container requirements?
Hi @pmores !
Hey @wainersm , thanks for the comprehensive report!
ywc!
Looking at the journal it's apparent that qemu-rs's vCPU hotplugging capability wasn't even called. The sandbox starts with a single coldplugged vCPU and a max of 2. No request for qemu-rs to hotplug additional vCPU(s) is ever made afterwards.
I noticed that too.
Looking at the k8s side of this, it appears that the container requests half a vCPU and limits itself to a single vCPU (is that correct? I'm not quite familiar with this ATM).
That's how I understood the test too.
If that's the case I actually wouldn't expect this setup to hotplug vCPUs. runtime-rs counts cold-plugged vCPUs towards the desired vCPU count and indeed, when I test with a single-container pod where the container requests say 4 vCPUs, runtime-rs asks the hypervisor to hotplug only 3 vCPUs since it has one already.
So in the present test case, if the container requests 500m of vCPU and limits itself to one, runtime-rs won't see a need to hotplug anything since it has one (cold-plugged) vCPU already.
But isn't this one (cold-plugged) counted as just a overhead?
If that's correct then this test should fail with other runtime-rs hypervisors (dragonball, CH) as well. I'm also wondering if the go shim behaviour in this scenario is different - is it possible that it doesn't count cold-plugged vCPU(s) toward container requirements?
It might indeed be a difference in behavior between the go and rust shims, resulting on the test failing on the later.
The k8s-cpu-ns.bats test is also disabled on both cloud-hypervisor and dragonball. For cloud-hypervisor, there is this https://github.com/kata-containers/kata-containers/issues/9039 that unfortunately doesn't provide any insights; as for dragonball, it seems that the test was disabled while waiting the vCPU resize implementation for dragonball and apparently it was forgotten to enable it (https://github.com/kata-containers/tests/pull/5392).
I will give it a try with dragonball and let you posted.
Hi @pmores !
Hey @wainersm , thanks for the comprehensive report!
ywc!
Looking at the journal it's apparent that qemu-rs's vCPU hotplugging capability wasn't even called. The sandbox starts with a single coldplugged vCPU and a max of 2. No request for qemu-rs to hotplug additional vCPU(s) is ever made afterwards.
I noticed that too.
Looking at the k8s side of this, it appears that the container requests half a vCPU and limits itself to a single vCPU (is that correct? I'm not quite familiar with this ATM).
That's how I understood the test too.
If that's the case I actually wouldn't expect this setup to hotplug vCPUs. runtime-rs counts cold-plugged vCPUs towards the desired vCPU count and indeed, when I test with a single-container pod where the container requests say 4 vCPUs, runtime-rs asks the hypervisor to hotplug only 3 vCPUs since it has one already. So in the present test case, if the container requests 500m of vCPU and limits itself to one, runtime-rs won't see a need to hotplug anything since it has one (cold-plugged) vCPU already.
But isn't this one (cold-plugged) counted as just a overhead?
If that's correct then this test should fail with other runtime-rs hypervisors (dragonball, CH) as well. I'm also wondering if the go shim behaviour in this scenario is different - is it possible that it doesn't count cold-plugged vCPU(s) toward container requirements?
It might indeed be a difference in behavior between the go and rust shims, resulting on the test failing on the later.
The k8s-cpu-ns.bats test is also disabled on both cloud-hypervisor and dragonball. For cloud-hypervisor, there is this #9039 that unfortunately doesn't provide any insights; as for dragonball, it seems that the test was disabled while waiting the vCPU resize implementation for dragonball and apparently it was forgotten to enable it (kata-containers/tests#5392).
I will give it a try with dragonball and let you posted.
Same error with dragonball: the container vcpu number is constantly one:
[run_kubernetes_tests.sh:114] INFO: Executing k8s-cpu-ns.bats
k8s-cpu-ns.bats
✗ Check CPU constraints
(in test file k8s-cpu-ns.bats, line 77)
`[ "$total_cpus_container" -eq "$total_cpus" ]' failed
[bats-exec-test:67] INFO: k8s configured to use runtimeclass
pod/constraints-cpu-test created
pod/constraints-cpu-test condition met
Check the total of cpus is equal to 2
total_cpus_container=1
total_cpus_container=1
total_cpus_container=1
total_cpus_container=1
total_cpus_container=1
total_cpus_container=1
total_cpus_container=1
total_cpus_container=1
total_cpus_container=1
total_cpus_container=1
Thanks for the additional info Wainer!
As for other hypervisors in runtime-rs, CH doesn't seem to have a vCPU resizing implementation at the moment so it makes sense that the test is disabled. Dragonball does seem to have it but it's subject to the same general runtime-rs vCPU counting logic as qemu so the test is expected to fail as you confirmed it does.
As far as runtime-go's logic, I'm unfortunately not very familiar with that code but here's what I can there (please double-check me as I might well be wrong :-)): the required vCPU count is obtained in Sandbox::UpdateResources() by summing up individual container requirements and adding the number of cold-plugged vCPUs right afterwards. This number is then passed to hypervisor's ResizeVCPUs().
Now qemu's ResizeVCPUs() implementation starts off by computing the current vCPU count as a sum of the number of cold-plugged vCPUs and currently hotplugged ones. Both vCPU counts (the required count supplied by the caller and current count just computed) are then compared and subtracted to get the number of vCPUs to hot(un)plug.
Since the number of cold-plugged vCPUs si part of both the required and current count as shown above, it cancels during the subtractions and is thus effectively ignored by the hotplugging logic of runtime-go. runtime-rs's behaviour could be replicated in runtime-go if Sandbox::UpdateResources() didn't include the number of cold-plugged vCPUs in the required vCPU count.
Now the question is how to deal with this difference. In general, it seems to me that either one of the approaches is overall better so both runtimes should implement that one, or they are different but equal and CI then needs to reflect that in having slightly different tests for runtime-go & runtime-rs. Unfortunately I'm not sure which one it is. :-)
Thanks for the additional info Wainer!
As for other hypervisors in runtime-rs, CH doesn't seem to have a vCPU resizing implementation at the moment so it makes sense that the test is disabled. Dragonball does seem to have it but it's subject to the same general runtime-rs vCPU counting logic as qemu so the test is expected to fail as you confirmed it does.
As far as runtime-go's logic, I'm unfortunately not very familiar with that code but here's what I can there (please double-check me as I might well be wrong :-)): the required vCPU count is obtained in
Sandbox::UpdateResources()by summing up individual container requirements and adding the number of cold-plugged vCPUs right afterwards. This number is then passed to hypervisor'sResizeVCPUs().
I'm not familiar with runtime (go or rust) code nor the vpcu calculation algorithm. So here I'm widely guessing.
One question that I have is whether the default_vcpus=1 is reserved by Kata (agent...etc) or shared with container's demands.
The situation that we have here is:
- 1 vcpu is cold-plugged
- 250m vcpu is the pod overhead (IIUC this is the value reserved for kata's processes in guest)
- 500m vcpu is request by the container
- 1 vcpu is the limit imposed to the container
For the initial request, 500m + 250m vcpus the 1 vcpu cold-plugged is enough. But the container is limited to 1 vcpu, so then it should allocate (hot-plug) one more vcpu for in case the container reach the 1 vcpu limit (+ 250m of overhead). Did I get it wrong?
May the different be theRoundUpNumVCPUs call in oldCPUs, newCPUs, err := s.hypervisor.ResizeVCPUs(ctx, RoundUpNumVCPUs(sandboxVCPUs)) in https://github.com/kata-containers/kata-containers/blob/0f2a4d202e90b39b50074725b2cfe9c3088a4e20/src/runtime/virtcontainers/sandbox.go#L2286 (go runtime)?
@Apokleos @studychao could you please enlighten us with your knowledge?
One question that I have is whether the default_vcpus=1 is reserved by Kata (agent...etc) or shared with container's demands.
To sum up, say I have default_vcpus=1 in my configuration.toml and launch a two-container pod with cpu limits of 4 and 2 vCPUs, respectively.
With runtime-rs I'll get a kata VM with 6 vCPUs. This is supported by observation and source code reading as well (linked above).
With runtime-go the VM will most likely get 7 vCPUs. This is supported by source code reading (also linked above) and also the results of the test being discussed.
For the initial request, 500m + 250m vcpus the 1 vcpu cold-plugged is enough. But the container is limited to 1 vcpu, so then it should allocate (hot-plug) one more vcpu for in case the container reach the 1 vcpu limit (+ 250m of overhead). Did I get it wrong?
Frankly I'm not sure which behaviour is (more) correct. I couldn't find a formal definition of the k8s container resource specification but from the description of how it's used it seems that request is what the container is guaranteed to get while limit is what it might be given if resources are available.
If that's correct then in the example you give I think no rules of k8s resource management would be broken if the resulting VM had just one vCPU since that comfortably accomodates the 250m + 500m requirement. Granted, it would effective lower the container's limit to 750m but the container was never guaranteed to get anything over its request in the first place. (The same thing seems to happen when k8s scheduler puts a pod on a node that can barely accomodate the pods requests.)
That's my understanding anyway.
One question that I have is whether the default_vcpus=1 is reserved by Kata (agent...etc) or shared with container's demands.
To sum up, say I have
default_vcpus=1in myconfiguration.tomland launch a two-container pod with cpu limits of 4 and 2 vCPUs, respectively.With runtime-rs I'll get a kata VM with 6 vCPUs. This is supported by observation and source code reading as well (linked above).
With runtime-go the VM will most likely get 7 vCPUs. This is supported by source code reading (also linked above) and also the results of the test being discussed.
A few days later I got this genius idea: "should I look the project's docs?"
- https://github.com/kata-containers/kata-containers/blob/main/docs/design/vcpu-handling-runtime-go.md
- https://github.com/kata-containers/kata-containers/blob/main/docs/design/vcpu-handling-runtime-rs.md
ps: I liked the runtime-go documentation better.
I also realized that the overhead (agent..etc) is not used to compute vcpu size.
For the initial request, 500m + 250m vcpus the 1 vcpu cold-plugged is enough. But the container is limited to 1 vcpu, so then it should allocate (hot-plug) one more vcpu for in case the container reach the 1 vcpu limit (+ 250m of overhead). Did I get it wrong?
Frankly I'm not sure which behaviour is (more) correct. I couldn't find a formal definition of the k8s container resource specification but from the description of how it's used it seems that
requestis what the container is guaranteed to get whilelimitis what it might be given if resources are available.If that's correct then in the example you give I think no rules of k8s resource management would be broken if the resulting VM had just one vCPU since that comfortably accomodates the 250m + 500m requirement. Granted, it would effective lower the container's limit to 750m but the container was never guaranteed to get anything over its
requestin the first place. (The same thing seems to happen when k8s scheduler puts a pod on a node that can barely accomodate the podsrequests.)That's my understanding anyway.
Ok, on the face of the new discoveries... I think we can do either:
- keep skipping the tests. Only adjust the skip message to the right reason we have the tests disable
- when runtime-rs takes over, re-write the tests and enable them
- adapt the tests to handle rust and go runtime implementations
Any preference?
And close this issue as "NOT A BUG".
Closing as not a bug.