nfd-topology-updater missed some cpu for Guaranteed pods
What happened:
nfd-topology-updater is used for report node's NUMA and Guaranteed pods's cpu used. But it missed cpu for Guaranteed pod which pod's have any no exclusiveCPU.
- Node
yj-kubevirtwork-001kubelet config single-numa-node and cpu-manager
--cpu-manager-policy=static--topology-manager-policy=single-numa-node
I have two pods in node:
root@yj-kubevirtmaster-01 ~ # kubectl get po -owide nginx-deployment-68589dd8dc-kzvzh virt-launcher-kubevirttesttest1-p545b
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-deployment-68589dd8dc-kzvzh 1/1 Running 0 2d16h 10.106.66.199 yj-kubevirtwork-001 <none> <none>
virt-launcher-kubevirttesttest1-p545b 3/3 Running 0 4h20m 10.106.66.177 yj-kubevirtwork-001 <none> 1/1
In cpu-manager allocated cpu in node
cat /var/lib/kubelet/cpu_manager_state | jq
{
"policyName": "static",
"defaultCpuSet": "1,3,24-25,48-51,53,55,76-77,100-103",
"entries": {
"609526ad-0de0-4543-8188-48f872576fc9": {
"compute": "26-34,78-86"
},
"64dc405a-1a9c-4029-ba5f-778fe5948a10": {
"compute": "0,2,4-9,52,54,56-61"
},
"8310086e-bc8b-4ab3-ac6e-ccc1129fecec": {
"compute": "35-47,87-99"
},
"98f43841-5488-49d7-9f79-29fa68791316": {
"nginx": "14-15,66-67"
},
"b4fd6f12-9cf5-459e-8fdb-3058a2ae22a3": {
"compute": "10-13,16-23,62-65,68-75"
}
},
"checksum": 2485504922
}
But in noderesourcetopologies CR, it only report nginx's CPU set, not report kubevirt-launcher's CPU
kubectl get noderesourcetopologies yj-kubevirtwork-001 -oyaml
apiVersion: topology.node.k8s.io/v1alpha2
attributes:
- name: topologyManagerPolicy
value: single-numa-node
- name: topologyManagerScope
value: container
- name: nodeTopologyPodsFingerprint
value: pfp0v001828baea44880d152
kind: NodeResourceTopology
metadata:
creationTimestamp: "2024-12-05T13:13:40Z"
generation: 66528
name: yj-kubevirtwork-001
ownerReferences:
- apiVersion: v1
kind: Namespace
name: node-feature-discovery
uid: 4072a5ce-0634-447e-ab31-dd3f4bc7abac
resourceVersion: "27862374"
uid: dd26aaf4-3f38-4708-bb86-89305475a715
topologyPolicies:
- SingleNUMANodeContainerLevel
zones:
- costs:
- name: node-0
value: 21
- name: node-1
value: 10
name: node-1
resources:
- allocatable: "52"
available: "52"
capacity: "52"
name: cpu
- allocatable: "201847795712"
available: "157778243583"
capacity: "201847795712"
name: memory
type: Node
- costs:
- name: node-0
value: 10
- name: node-1
value: 21
name: node-0
resources:
- allocatable: "50"
available: "46"
capacity: "52"
name: cpu
- allocatable: "183845359616"
available: "130691427327"
capacity: "201130086400"
name: memory
type: Node
In nfd-topology-updater logs:
I1213 06:06:41.791096 1 nfd-topology-updater.go:235] "received updated pod resources" podResources=<
- Containers:
- Name: nginx
Resources:
- Data:
- "66"
- "67"
- "14"
- "15"
Name: cpu
NumaNodeIds: null
- Data:
- "8589934592"
Name: memory
NumaNodeIds:
- 0
Name: nginx-deployment-68589dd8dc-kzvzh
Namespace: default
- Containers:
- Name: compute
Resources:
- Data:
- vhost-net172
Name: devices.kubevirt.io/vhost-net
NumaNodeIds: null
- Data:
- kvm760
Name: devices.kubevirt.io/kvm
NumaNodeIds: null
- Data:
- tun96
Name: devices.kubevirt.io/tun
NumaNodeIds: null
- Data:
- "9141485568"
Name: memory
NumaNodeIds:
- 0
- Name: volumecontainerdisk
Resources:
- Data:
- "40000000"
Name: memory
NumaNodeIds:
- 0
- Name: guest-console-log
Resources:
- Data:
- "60000000"
Name: memory
NumaNodeIds:
- 0
Name: virt-launcher-kubevirttesttest2-8dnfj
Namespace: default
>
nfd-topology-updater not got kubevirt-launcher pod's compute container's CPU.
What you expected to happen:
nfd-topology-updater can report all pod container's exclusiveCPU.
How to reproduce it (as minimally and precisely as possible):
For example, create a pod Pod-a have 2 containers:
containers:
- name: ctr-a
resources:
limits:
cpu: 5
requests:
cpu: 5
- name: ctr-b
resources:
limits:
cpu: 500m
requests:
cpu: 500m
pod is still a Guaranteed pod , and allocate exclusiveCPU by kubelet's cpu-manager, but nfd-topology-updater will not report it's CPU in topology CR.
Anything else we need to know?:
I thought this bug is cause by hasExclusiveCPUs func:
https://github.com/kubernetes-sigs/node-feature-discovery/blob/141607269983adbf13f7f84b3f02314e49c3d278/pkg/resourcemonitor/podresourcesscanner.go#L78-L107
if pod has any container not request cpu as integer, it will return false.
https://github.com/kubernetes-sigs/node-feature-discovery/blob/141607269983adbf13f7f84b3f02314e49c3d278/pkg/resourcemonitor/podresourcesscanner.go#L168-L183
It cause nfd-topology-updater skip this pod's all containers loop for report CPU.
Environment:
- Kubernetes version (use
kubectl version): - Cloud provider or hardware configuration:
- OS (e.g:
cat /etc/os-release): - Kernel (e.g.
uname -a): - Install tools:
- Network plugin and version (if this is a network-related bug):
- Others:
@AllenXu93 it is a deliberate decision/feature to only count exclusively allocated CPUs (of Guaranteed pods). Could you open up your usage scenario(s) a bit for counting also cpus in the shared pool? Maybe we could have a config option/cmdline flag to enable this.
@PiotrProkop what are your thoughts on this?
EDIT: @AllenXu93 see https://kubernetes-sigs.github.io/node-feature-discovery/stable/usage/nfd-topology-updater.html
@AllenXu93 it is a deliberate decision/feature to only count exclusively allocated CPUs (of Guaranteed pods). Could you open up your usage scenario(s) a bit for counting also cpus in the shared pool? Maybe we could have a config option/cmdline flag to enable this.
@PiotrProkop what are your thoughts on this?
EDIT: @AllenXu93 see https://kubernetes-sigs.github.io/node-feature-discovery/stable/usage/nfd-topology-updater.html
This PR is still use exclusively allocated CPUs, it doesn't change. But exclusively allocated CPUs is not mean pod's all container has exclusively CPUs. For example, if pod has two containers, one is 2 Core CPU request and limits, one is 500m CPU request and limits, the pod is still Guaranteed pods, and the first container is exclusively CPU by expected. But now topo-updater will skip this pod's CPU report.
Ah yes, @AllenXu93 I read the description too hastily, not paying attention to this detail.
I think what you describe makes sense (i.e. nfd-topology-updater should report/count exclusively allocated cpus for pods, even if some of the containers within the pod use shared cpus).
WDYT @PiotrProkop @ffromani, something we're missing here?
Ah yes, @AllenXu93 I read the description too hastily, not paying attention to this detail.
I think what you describe makes sense (i.e. nfd-topology-updater should report/count exclusively allocated cpus for pods, even if some of the containers within the pod use shared cpus).
WDYT @PiotrProkop @ffromani, something we're missing here?
I think there's a good point here. Need to review the logic and count all the exclusively allocated CPUs
Ah yes, @AllenXu93 I read the description too hastily, not paying attention to this detail.
I think what you describe makes sense (i.e. nfd-topology-updater should report/count exclusively allocated cpus for pods, even if some of the containers within the pod use shared cpus).
WDYT @PiotrProkop @ffromani, something we're missing here?
In fact, I use kubevirt to manager vm by k8s, all the kubevirt pods created this way. VM container can allocate Integral CPUs, there are many other containers and init-containers only request 200m CPU.
But the VM container will allocate CPU in one NUMA if kubelet open single-numa-node, nfd-topology-updater miss them will cause error when schedule other pods .
Ah yes, @AllenXu93 I read the description too hastily, not paying attention to this detail. I think what you describe makes sense (i.e. nfd-topology-updater should report/count exclusively allocated cpus for pods, even if some of the containers within the pod use shared cpus). WDYT @PiotrProkop @ffromani, something we're missing here?
In fact, I use kubevirt to manager vm by k8s, all the kubevirt pods created this way. VM container can allocate Integral CPUs, there are many other containers and init-containers only request 200m CPU. But the VM container will allocate CPU in one NUMA if kubelet open
single-numa-node, nfd-topology-updater miss them will cause error when schedule other pods .
I'm a bit unsure about the QoS here and I need to review (again) the rules, but I totally agree that all the containers which have exclusive CPUs allocated in guaranteed QoS pods should be reported
Ah yes, @AllenXu93 I read the description too hastily, not paying attention to this detail. I think what you describe makes sense (i.e. nfd-topology-updater should report/count exclusively allocated cpus for pods, even if some of the containers within the pod use shared cpus). WDYT @PiotrProkop @ffromani, something we're missing here?
In fact, I use kubevirt to manager vm by k8s, all the kubevirt pods created this way. VM container can allocate Integral CPUs, there are many other containers and init-containers only request 200m CPU. But the VM container will allocate CPU in one NUMA if kubelet open
single-numa-node, nfd-topology-updater miss them will cause error when schedule other pods .I'm a bit unsure about the QoS here and I need to review (again) the rules, but I totally agree that all the containers which have exclusive CPUs allocated in guaranteed QoS pods should be reported
Of course, I learn from the docs https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/#create-a-pod-that-gets-assigned-a-qos-class-of-guaranteed
In my opinion: Guaranteed pod is For every Container in the Pod, all have CPU / Memory limit and request, limit and request should be equal.
@AllenXu93 you made good points, there could be an actual bug in the area you identified. We can use this issue to add more unit tests. I'll try to have a look ASAP but europe holiday season is incoming. Others are welcome to chime in, I'll surely review.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten - Close this issue with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
I think this issue is still valid/relevant
/reopen /remove-lifecycle rotten
@marquiz: Reopened this issue.
In response to this:
I think this issue is still valid/relevant
/reopen /remove-lifecycle rotten
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.