CPU stuck on 100%
after updating KubeVirt version from 0.47.1 we can't work at all. creating VMs give us immediately or after 1 min " watchdog: BUG: soft lockup - CPU#75 stuck"
I did a couple of tests: 0.47.1 works well and it starts to fail from 0.48.1 my suspect path is https://github.com/kubevirt/kubevirt/pull/6162 we run VMs with dual NUMA node so maybe we have "starvation" in such cases ??
NOTH: I test it with k8s 1.19, 1.20, 1.21, 1.23 same results
server configuration: 1T GB memory 156 CPU and ~500GB assigned to VMs

Environment:
- KubeVirt version (use
virtctl version): N/A - Kubernetes version (use
kubectl version): N/A - VM or VMI specifications: N/A
- Cloud provider or hardware configuration: N/A
- OS (e.g. from /etc/os-release): N/A
- Kernel (e.g.
uname -a): N/A - Install tools: N/A
- Others: N/A
Can you provide the kubevirt vm yaml, and possibly the feature gate configuration that you are running as well?
Name: tacohen-habana-nnwm-c06-vm
Namespace: habana
Labels: habana.ai/is-vmi=true
habana.ai/user=tacohen
kubevirt.io/os=linux
Annotations: container.apparmor.security.beta.kubernetes.io/compute: unconfined
kubevirt.io/latest-observed-api-version: v1
kubevirt.io/storage-observed-api-version: v1alpha3
API Version: kubevirt.io/v1
Kind: VirtualMachine
Metadata:
Creation Timestamp: 2022-09-13T16:42:35Z
Generation: 1
Managed Fields:
API Version: kubevirt.io/v1alpha3
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
f:kubevirt.io/latest-observed-api-version:
f:kubevirt.io/storage-observed-api-version:
f:status:
.:
f:volumeSnapshotStatuses:
Manager: Go-http-client
Operation: Update
Time: 2022-09-13T16:42:36Z
API Version: kubevirt.io/v1alpha3
Fields Type: FieldsV1
fieldsV1:
f:status:
f:conditions:
f:created:
f:printableStatus:
f:ready:
Manager: Go-http-client
Operation: Update
Subresource: status
Time: 2022-09-13T16:43:32Z
Resource Version: 99877899
UID: 8dcd1d32-3fc0-4f5e-9353-29ca8ffc7dcb
Spec:
Run Strategy: RerunOnFailure
Template:
Metadata:
Annotations:
container.apparmor.security.beta.kubernetes.io/compute: unconfined
habana.ai/hlctl-version: 1.2.0
habana.ai/qa.nightly: false
habana.ai/schedulable: false
pod-reaper/max-duration: 4h
Creation Timestamp:
apiVersion: kubevirt.io/v1 kind: KubeVirt metadata: name: kubevirt namespace: kubevirt spec: certificateRotateStrategy: {} configuration: developerConfiguration: featureGates: - DataVolumes - SRIOV - LiveMigration - CPUManager - ExperimentalVirtiofsSupport - HostDisk - GPU - NUMA - HostDevices
set node affinity for virt-api and virt-controller
infra: nodePlacement: nodeSelector: habana.ai/services: "true" permittedHostDevices: pciHostDevices: - pciVendorSelector: 1da3:0020 resourceName: habana.ai/greco externalResourceProvider: true - pciVendorSelector: 1da3:0030 resourceName: habana.ai/greco externalResourceProvider: true - pciVendorSelector: 1da3:1030 resourceName: habana.ai/gaudi externalResourceProvider: true - pciVendorSelector: 1da3:1020 resourceName: habana.ai/gaudi externalResourceProvider: true - pciVendorSelector: 1da3:1010 resourceName: habana.ai/gaudi externalResourceProvider: true - pciVendorSelector: 1da3:1011 resourceName: habana.ai/gaudi externalResourceProvider: true - pciVendorSelector: 1da3:1001 resourceName: habana.ai/gaudi externalResourceProvider: true - pciVendorSelector: 1da3:1000 resourceName: habana.ai/gaudi externalResourceProvider: true - pciVendorSelector: 1da3:0001 resourceName: habana.ai/goya externalResourceProvider: true customizeComponents: {} imagePullPolicy: IfNotPresent workloadUpdateStrategy: {}
@talcoh2x can you please try running the same with
spec:
domain:
cpu:
dedicatedCpuPlacement: true
isolateEmulatorThread: true
This will isolate the vCPUs from the rest of the process in the compute container and should not interrupt it.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
/close
@kubevirt-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity. Reopen the issue with
/reopen. Mark the issue as fresh with/remove-lifecycle rotten./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.