kubevirt icon indicating copy to clipboard operation
kubevirt copied to clipboard

CPU stuck on 100%

Open talcoh2x opened this issue 3 years ago • 4 comments

after updating KubeVirt version from 0.47.1 we can't work at all. creating VMs give us immediately or after 1 min " watchdog: BUG: soft lockup - CPU#75 stuck"

I did a couple of tests: 0.47.1 works well and it starts to fail from 0.48.1 my suspect path is https://github.com/kubevirt/kubevirt/pull/6162 we run VMs with dual NUMA node so maybe we have "starvation" in such cases ??

NOTH: I test it with k8s 1.19, 1.20, 1.21, 1.23 same results

server configuration: 1T GB memory 156 CPU and ~500GB assigned to VMs

image image

Environment:

  • KubeVirt version (use virtctl version): N/A
  • Kubernetes version (use kubectl version): N/A
  • VM or VMI specifications: N/A
  • Cloud provider or hardware configuration: N/A
  • OS (e.g. from /etc/os-release): N/A
  • Kernel (e.g. uname -a): N/A
  • Install tools: N/A
  • Others: N/A

talcoh2x avatar Sep 09 '22 11:09 talcoh2x

Can you provide the kubevirt vm yaml, and possibly the feature gate configuration that you are running as well?

usrbinkat avatar Sep 14 '22 14:09 usrbinkat

Name: tacohen-habana-nnwm-c06-vm Namespace: habana Labels: habana.ai/is-vmi=true habana.ai/user=tacohen kubevirt.io/os=linux Annotations: container.apparmor.security.beta.kubernetes.io/compute: unconfined kubevirt.io/latest-observed-api-version: v1 kubevirt.io/storage-observed-api-version: v1alpha3 API Version: kubevirt.io/v1 Kind: VirtualMachine Metadata: Creation Timestamp: 2022-09-13T16:42:35Z Generation: 1 Managed Fields: API Version: kubevirt.io/v1alpha3 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: f:kubevirt.io/latest-observed-api-version: f:kubevirt.io/storage-observed-api-version: f:status: .: f:volumeSnapshotStatuses: Manager: Go-http-client Operation: Update Time: 2022-09-13T16:42:36Z API Version: kubevirt.io/v1alpha3 Fields Type: FieldsV1 fieldsV1: f:status: f:conditions: f:created: f:printableStatus: f:ready: Manager: Go-http-client Operation: Update Subresource: status Time: 2022-09-13T16:43:32Z Resource Version: 99877899 UID: 8dcd1d32-3fc0-4f5e-9353-29ca8ffc7dcb Spec: Run Strategy: RerunOnFailure Template: Metadata: Annotations: container.apparmor.security.beta.kubernetes.io/compute: unconfined habana.ai/hlctl-version: 1.2.0 habana.ai/qa.nightly: false habana.ai/schedulable: false pod-reaper/max-duration: 4h Creation Timestamp: Labels: habana.ai/schedulable: false habana.ai/user: tacohen Service: tacohen-habana-nnwm-c06-service Vmi: tacohen-habana-nnwm-c06-vm Spec: Affinity: Node Affinity: Required During Scheduling Ignored During Execution: Node Selector Terms: Match Expressions: Key: kubernetes.io/hostname Operator: In Values: hls2-srv65-c06e-kfs Pod Anti Affinity: Required During Scheduling Ignored During Execution: Label Selector: Match Expressions: Key: habana.ai/is-container Operator: In Values: true Topology Key: kubernetes.io/hostname Domain: Cpu: Cores: 39 Dedicated Cpu Placement: true Model: host-passthrough Sockets: 2 Threads: 2 Devices: Block Multi Queue: true Disks: Dedicated IO Thread: true Disk: Bus: virtio Name: localdisk Disk: Bus: virtio Name: cloud-init Disk: Bus: virtio Name: app-config-disk Serial: kubedisk Filesystems: Name: disk0 Virtiofs: Name: disk1 Virtiofs: Gpus: Device Name: habana.ai/gaudi Name: gpu0 Device Name: habana.ai/gaudi Name: gpu1 Device Name: habana.ai/gaudi Name: gpu2 Device Name: habana.ai/gaudi Name: gpu3 Device Name: habana.ai/gaudi Name: gpu4 Device Name: habana.ai/gaudi Name: gpu5 Device Name: habana.ai/gaudi Name: gpu6 Device Name: habana.ai/gaudi Name: gpu7 Interfaces: Name: sriov-net Sriov: Name: sriov-net1 Sriov: Name: sriov-net2 Sriov: Name: sriov-net3 Sriov: Name: sriov-net4 Sriov: Network Interface Multiqueue: true Io Threads Policy: auto Machine: Type: q35 Memory: Guest: 480Gi Resources: Requests: Memory: 520Gi Networks: Multus: Network Name: habana/sriov-net Name: sriov-net Multus: Network Name: habana/sriov-net-x5 Name: sriov-net1 Multus: Network Name: habana/sriov-net-x5 Name: sriov-net2 Multus: Network Name: habana/sriov-net-x5 Name: sriov-net3 Multus: Network Name: habana/sriov-net-x5 Name: sriov-net4 Node Selector: habana.ai/qa.nightly: false habana.ai/schedulable: false Scheduler Name: most-allocated-scheduler Termination Grace Period Seconds: 0 Volumes: Config Map: Name: kube-config Name: app-config-disk Name: localdisk Persistent Volume Claim: Claim Name: tacohen-habana-nnwm-c06-pvc Name: disk0 Persistent Volume Claim: Claim Name: ccache-volume-pvc Name: disk1 Persistent Volume Claim: Claim Name: hostname-volume-pvc Cloud Init No Cloud: User Data: #cloud-config hostname: tacohen-habana-nnwm-c06-vm

talcoh2x avatar Sep 14 '22 16:09 talcoh2x


apiVersion: kubevirt.io/v1 kind: KubeVirt metadata: name: kubevirt namespace: kubevirt spec: certificateRotateStrategy: {} configuration: developerConfiguration: featureGates: - DataVolumes - SRIOV - LiveMigration - CPUManager - ExperimentalVirtiofsSupport - HostDisk - GPU - NUMA - HostDevices

set node affinity for virt-api and virt-controller

infra: nodePlacement: nodeSelector: habana.ai/services: "true" permittedHostDevices: pciHostDevices: - pciVendorSelector: 1da3:0020 resourceName: habana.ai/greco externalResourceProvider: true - pciVendorSelector: 1da3:0030 resourceName: habana.ai/greco externalResourceProvider: true - pciVendorSelector: 1da3:1030 resourceName: habana.ai/gaudi externalResourceProvider: true - pciVendorSelector: 1da3:1020 resourceName: habana.ai/gaudi externalResourceProvider: true - pciVendorSelector: 1da3:1010 resourceName: habana.ai/gaudi externalResourceProvider: true - pciVendorSelector: 1da3:1011 resourceName: habana.ai/gaudi externalResourceProvider: true - pciVendorSelector: 1da3:1001 resourceName: habana.ai/gaudi externalResourceProvider: true - pciVendorSelector: 1da3:1000 resourceName: habana.ai/gaudi externalResourceProvider: true - pciVendorSelector: 1da3:0001 resourceName: habana.ai/goya externalResourceProvider: true customizeComponents: {} imagePullPolicy: IfNotPresent workloadUpdateStrategy: {}

talcoh2x avatar Sep 14 '22 16:09 talcoh2x

@talcoh2x can you please try running the same with

spec:
  domain:
    cpu:
      dedicatedCpuPlacement: true
      isolateEmulatorThread: true

This will isolate the vCPUs from the rest of the process in the compute container and should not interrupt it.

vladikr avatar Sep 18 '22 09:09 vladikr

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kubevirt-bot avatar Dec 17 '22 09:12 kubevirt-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

kubevirt-bot avatar Jan 16 '23 10:01 kubevirt-bot

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

kubevirt-bot avatar Feb 15 '23 10:02 kubevirt-bot

@kubevirt-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kubevirt-bot avatar Feb 15 '23 10:02 kubevirt-bot