[BUG] 计算节点偶然重启后一直状态未知
问题描述/What happened: 计算节点偶然重启后一直状态未知,
default-host 重启也无法恢复 查看 default-host 有报错:
[error 2024-02-06 09:32:48 hostinfo.(*SHostInfo).prepareEnv(hostinfo.go:371)] tuned-adm profile virtual-host fail: exec: "tuned-adm": executable file not found in $PATH [error 2024-02-06 09:32:48 fileutils2.GetAllBlkdevsIoSchedulers(fileutils.go:170)] no block device avaiable [error 2024-02-06 09:32:48 hostinfo.(*SHostInfo).detectOsDist(hostinfo.go:746)] exit status 1 [error 2024-02-06 09:32:48 hostinfo.(*SHostInfo).detectOsDist(hostinfo.go:758)] Failed to detect distribution info
环境/Environment:
-
OS (e.g.
cat /etc/os-release): PRETTY_NAME="Ubuntu 22.04.3 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.3 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy -
Kernel (e.g.
uname -a): 5.15.0-92-generic #102-Ubuntu SMP Wed Jan 10 09:33:48 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux -
Host: (e.g.
dmidecode | egrep -i 'manufacturer|product' |sort -u) Manufacturer: INTEL Manufacturer: Intel(R) Corporation Manufacturer: Kingston Manufacturer: Micro-Star International Co., Ltd. Manufacturer: To Be Filled By O.E.M. Memory Subsystem Controller Manufacturer ID: Unknown Memory Subsystem Controller Product ID: Unknown Module Manufacturer ID: Bank 2, Hex 0x98 Module Product ID: Unknown Product Name: MS-7D48 Product Name: PRO H610M-E DDR4 (MS-7D48) -
Service Version (e.g.
kubectl exec -n onecloud $(kubectl get pods -n onecloud | grep climc | awk '{print $1}') -- climc version-list):
+------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Field | Value | +------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ansible | release/3.10(08e594819d23122604) | | apimap | release/3.10(08e594819d23122604) | | cloudevent | release/3.10(08e594819d23122604) | | cloudid | release/3.10(08e594819d23122604) | | cloudmeta | {"error":{"class":"DNSError","code":499,"details":"Get "https://meta.yunion.cn/version": dial tcp: lookup meta.yunion.cn on 10.158.158.10:53: no such host","request":{"headers":{"User-Agent":"yunioncloud-go/201708","X-Auth-Token":""},"method":"GET","url":"https://meta.yuni | | | on.cn/version"}}} | | cloudmon | release/3.10(08e594819d23122604) | | cloudproxy | release/3.10(08e594819d23122604) | | compute_v2 | release/3.10(08e594819d23122604) | | devtool | release/3.10(08e594819d23122604) | | dns | {"error":{"class":"ClientError","code":499,"details":"Get "10.158.158.10/version": unsupported protocol scheme ""","request":{"headers":{"User-Agent":"yunioncloud-go/201708","X-Auth-Token":""},"method":"GET","url":"10.158.158.10/version"}}} | | etcd | {"etcdserver":"3.4.6","etcdcluster":"3.4.0"} | | identity | release/3.10(08e594819d23122604) | | image | release/3.10(08e594819d23122604) | | k8s | heads/v3.10.10-20231225.1(c2602e0223122603) | | log | release/3.10(08e594819d23122604) | | monitor | release/3.10(08e594819d23122604) | | notify | release/3.10(08e594819d23122604) | | ntp | {"error":{"class":"ClientError","code":499,"details":"Get "10.158.158.11/version": unsupported protocol scheme ""","request":{"headers":{"User-Agent":"yunioncloud-go/201708","X-Auth-Token":""},"method":"GET","url":"10.158.158.11/version"}}} | | scheduledtask | release/3.10(08e594819d23122604) | | scheduler | release/3.10(08e594819d23122604) | | torrent-tracker | {"error":{"class":"DNSError","code":499,"details":"Get "https://tracker.yunion.cn/version": dial tcp: lookup tracker.yunion.cn on 10.158.158.10:53: no such host","request":{"headers":{"User-Agent":"yunioncloud-go/201708","X-Auth-Token":""},"method":"GET","url":"https://tra | | | cker.yunion.cn/version"}}} | | victoria-metrics | remoteAddr: "10.158.159.8:17776"; requestURI: /version; unsupported path requested: "/version" | | vpcagent | release/3.10(08e594819d23122604) | | webconsole | release/3.10(08e594819d23122604) | | yunionconf | release/3.10(08e594819d23122604) | +------------------+----------------------------------------
安装 tuned,然后重启default-host 节点状态恢复正常 sudo apt install tuned tuned-utils tuned-utils-systemtap
计算节点状态恢复正常后,default-host 有新报错,里边的虚拟机打不开,计算节点有设置 pci 透传 gpu
[error 2024-02-06 09:53:38 fileutils2.GetAllBlkdevsIoSchedulers(fileutils.go:170)] no block device avaiable [error 2024-02-06 09:53:38 hostinfo.(*SHostInfo).detectOsDist(hostinfo.go:746)] exit status 1 [error 2024-02-06 09:53:38 hostinfo.(*SHostInfo).detectOsDist(hostinfo.go:758)] Failed to detect distribution info [error 2024-02-06 09:53:41 hostinfo.(*SHostInfo).PutHostOnline(hostinfo.go:1552)] Host sys error: map[isolated_devices:[{isolated_devices GPU 00:02.0 use kernel driver i915, skip it 2024-02-06 09:53:41.046983119 +0000 UTC m=+3.911778443} {isolated_devices 01:00.0 GeForce RTX 2060 SUPER CustomProbe failed bind driver: write /sys/bus/pci/drivers/vfio-pci/new_id: file exists 2024-02-06 09:53:41.066664282 +0000 UTC m=+3.931459602}]] [error 2024-02-06 09:58:33 guestman.(*SKVMGuestInstance).StartMonitor(qemu-kvm.go:824)] Guest 66dc0986-8d54-41d0-8157-06cc0687177f start monitor failed, can't get qmp monitor port or monitor path [error 2024-02-06 09:58:34 guestman.(*SKVMGuestInstance).onMonitorDisConnect(qemu-kvm.go:1370)] Guest 66dc0986-8d54-41d0-8157-06cc0687177f on Monitor Disconnect reason: read tcp 127.0.0.1:53578->127.0.0.1:56101: read: connection reset by peer
@sun3book 看起来是绑定vfio驱动失败了,部署后宿主机是否重启过?gpu是否有其他的驱动?
@sun3book 看起来是绑定vfio驱动失败了,部署后宿主机是否重启过?gpu是否有其他的驱动?
宿主机器有重启过,没有其它驱动,这里 有点像是 核显 和 独显 冲突了
日志显示在透传集显
[info 2024-02-06 10:32:07 isolated_device.(*isolatedDeviceManager).probeCustomPCIDevs(isolated_device.go:184)] Add general pci device: 0 => &isolated_device.sGeneralPCIDevice{sBaseDevice:(*isolated_device.sBaseDevice)(0xc001524640)} [info 2024-02-06 10:32:07 isolated_device.getPassthroughGPUS(gpu.go:75)] filter address [01:00.0] [info 2024-02-06 10:32:07 isolated_device.(*PCIDevice).IsBootVGA(gpu.go:307)] PCI address 00:02.0 is boot_vga: /sys/devices/pci0000:00/0000:00:02.0/boot_vga [info 2024-02-06 10:32:07 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:325)] &isolated_device.PCIDevice{Addr:"00:02.0", ClassName:"VGA compatible controller", ClassCode:"0300", VendorName:"Intel Corporation", VendorId:"8086", DeviceName:"Device", DeviceId:"4692", SubvendorName:"Micro-Star International Co., Ltd. [MSI]", SubvendorId:"1462", SubdeviceName:"Device", SubdeviceId:"7d48", ModelName:"", RestIOMMUGroupDevs:[]*isolated_device.PCIDevice(nil)} is boot vga card, skip it [warning 2024-02-06 10:32:07 isolated_device.getPassthroughGPUS(gpu.go:102)] GPU {"bus_id":"00:02.0","class_code":"0300","class_name":"VGA compatible controller","device_id":"4692","device_name":"Device","subdevice_id":"7d48","subdevice_name":"Device","subvendor_id":"1462","subvendor_name":"Micro-Star International Co., Ltd. [MSI]","vendor_id":"8086","vendor_name":"Intel Corporation"} use kernel driver "i915", skip it
@sun3book 这里日志的意思是跳过了集显,但是你的NVIDIA的显卡绑定 vfio驱动失败
isolated_devices 01:00.0 GeForce RTX 2060 SUPER CustomProbe failed bind driver: write /sys/bus/pci/drivers/vfio-pci/new_id
你检查一下宿主机 /proc/cmdline 是否有注入 vfio相关参数
rdblacklist=nouveau vfio_iommu_type1.allow_unsafe_interrupts=1 intel_iommu=on quiet iommu=pt nouveau.modeset=0 hugepagesz=1G default_hugepagesz=1G mgag200.modeset=0
/proc/cmdline
root@zhcx-cloudpods-worker01:~# cat /proc/cmdline BOOT_IMAGE=/vmlinuz-5.15.0-92-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro systemd.unified_cgroup_hierarchy=0 hugepagesz=1G default_hugepagesz=1G
vfio_iommu_type1.allow_unsafe_interrupts=1 intel_iommu=on quiet iommu=pt nouveau.modeset=0
在 grub 中添加一下这些参数重启一下虚机试试 vfio_iommu_type1.allow_unsafe_interrupts=1 intel_iommu=on quiet iommu=pt nouveau.modeset=0
vfio_iommu_type1.allow_unsafe_interrupts=1 intel_iommu=on quiet iommu=pt nouveau.modeset=0
在 grub 中添加一下这些参数重启一下虚机试试 vfio_iommu_type1.allow_unsafe_interrupts=1 intel_iommu=on quiet iommu=pt nouveau.modeset=0
nano /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"
update-grub
echo -e "vfio\nvfio_iommu_type1\nvfio_pci\nvfio_virqfd" >> /etc/modules
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
update-initramfs -u -k all
reboot 后,vgpu正常了,但是 计算节点一直不就绪。
NAME STATUS ROLES AGE VERSION
zhcx-cloudpods-worker01 NotReady
systemctl show --property=Environment kubelet | cat Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --cgroup-driver=systemd" KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml
修改kubelet的Cgroup Driver 修改/etc/systemd/system/kubelet.service.d/10-kubeadm.conf文件,增加–cgroup-driver=systemd (官方推荐用systemd)
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --cgroup-driver=systemd"
systemctl daemon-reload systemctl restart kubelet
执行以上操作也未能解决。
报错信息如下:
Feb 07 02:54:29 zhcx-cloudpods-worker01 systemd[1]: kubelet.service: Failed with result 'exit-code'. Feb 07 02:54:40 zhcx-cloudpods-worker01 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 69. Feb 07 02:54:40 zhcx-cloudpods-worker01 systemd[1]: Stopped kubelet: The Kubernetes Node Agent. Feb 07 02:54:40 zhcx-cloudpods-worker01 systemd[1]: Started kubelet: The Kubernetes Node Agent. Feb 07 02:54:40 zhcx-cloudpods-worker01 kubelet[10210]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information. Feb 07 02:54:40 zhcx-cloudpods-worker01 kubelet[10210]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information. Feb 07 02:54:40 zhcx-cloudpods-worker01 kubelet[10210]: Flag --resolv-conf has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information. Feb 07 02:54:40 zhcx-cloudpods-worker01 kubelet[10210]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information. Feb 07 02:54:40 zhcx-cloudpods-worker01 kubelet[10210]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information. Feb 07 02:54:40 zhcx-cloudpods-worker01 kubelet[10210]: Flag --resolv-conf has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information. Feb 07 02:54:40 zhcx-cloudpods-worker01 kubelet[10210]: I0207 02:54:40.193491 10210 server.go:425] Version: v1.15.12 Feb 07 02:54:40 zhcx-cloudpods-worker01 kubelet[10210]: I0207 02:54:40.193605 10210 plugins.go:103] No cloud provider specified. Feb 07 02:54:40 zhcx-cloudpods-worker01 kubelet[10210]: I0207 02:54:40.193611 10210 server.go:789] Client rotation is on, will bootstrap in background Feb 07 02:54:40 zhcx-cloudpods-worker01 kubelet[10210]: I0207 02:54:40.199943 10210 certificate_store.go:129] Loading cert/key pair from "/var/lib/kubelet/pki/kubelet-client-current.pem". Feb 07 02:54:40 zhcx-cloudpods-worker01 kubelet[10210]: W0207 02:54:40.200318 10210 server.go:628] failed to get the kubelet's cgroup: mountpoint for cpu not found. Kubelet system container metrics may be missing. Feb 07 02:54:40 zhcx-cloudpods-worker01 kubelet[10210]: W0207 02:54:40.200340 10210 server.go:635] failed to get the container runtime's cgroup: failed to get container name for docker process: mountpoint for cpu not found. Runtime system container metrics may be missing.
/etc/default/grub 中 GRUB_CMDLINE_LINUX 增加 cgroup_enable=cpuset cgroup_enable=memory cgroup_memory=1 swapaccount=1 systemd.unified_cgroup_hierarchy=0 ,服务全部都能正常启动了,但是节点状态还是未知
@sun3book 看下region服务的日志,是否正常?kubectl -n onecloud get pods -l app=region
If you do not provide feedback for more than 37 days, we will close the issue and you can either reopen it or submit a new issue.
您超过 37 天未反馈信息,我们将关闭该 issue,如有需求您可以重新打开或者提交新的 issue。