cloudpods
cloudpods copied to clipboard
[BUG] 计算节点无法开启 hugepage 导致 default-host 启动失败
问题描述/What happened: 一台机器,安装完centos 之后,u盘忘记拔了,启动后,加入计算节点失败, default-host-*:host服务错误日志:
[error 2023-07-20 04:59:11 fileutils2.GetAllBlkdevsIoSchedulers(fileutils.go:170)] no block device avaiable
>kubectl -n onecloud logs default-host-pchvq -c host
[info 230720 07:10:40 procutils.WaitZombieLoop(zombie_others.go:36)] My pid is not 1 and no need to wait zombies
[info 230720 07:10:40 options.ParseOptions(options.go:314)] Use configuration file: /etc/yunion/host.conf
[warning 230720 07:10:40 structarg.(*ArgumentParser).parseJSONKeyValue(structarg.go:1214)] Cannot find argument enable-qmp-monitor
[warning 230720 07:10:40 structarg.(*ArgumentParser).parseJSONKeyValue(structarg.go:1214)] Cannot find argument health-driver
[warning 230720 07:10:40 structarg.(*ArgumentParser).parseJSONKeyValue(structarg.go:1214)] Cannot find argument enable-health-checker
[warning 230720 07:10:40 structarg.(*ArgumentParser).parseJSONKeyValue(structarg.go:1214)] Cannot find argument disk-is-ssd
[warning 230720 07:10:40 structarg.(*ArgumentParser).parseJSONKeyValue(structarg.go:1214)] Cannot find argument enable-rbac
[info 230720 07:10:40 options.ParseOptions(options.go:336)] Set log level to "info"
[info 2023-07-20 07:10:40 options.ParseOptions(options.go:314)] Use configuration file: /etc/yunion/common/common.conf
[info 2023-07-20 07:10:40 options.ParseOptions(options.go:336)] Set log level to "info"
[info 2023-07-20 07:10:40 hostman.(*SHostService).InitService(host_services.go:63)] exec socket path: /var/run/onecloud/exec.sock
[info 2023-07-20 07:10:40 app.InitApp(app.go:32)] RequestWorkerCount: 8
[info 2023-07-20 07:10:40 appsrv.NewApplication(appsrv.go:116)] App hostId: upe8BgoNNfnD6KNuEDdLK1yiuo4= (host,sz-node-8-5,192.168.8.5)
2023/07/20 07:10:40 Allow hosts []
[info 2023-07-20 07:10:40 appsrv.(*Application).SetDefaultTimeout(appsrv.go:132)] adjust application default timeout to 60.000000 seconds
[info 2023-07-20 07:10:40 hostinfo.DetectCpuInfo(hostinfohelper.go:77)] cpuinfo freq 2420
[info 2023-07-20 07:10:40 hostinfo.NewHostInfo(hostinfo.go:2243)] CPU Model Intel(R) Xeon(R) CPU @ 2.27GHz Microcode 0x1d
[info 2023-07-20 07:10:41 hostinfo.NewHostInfo(hostinfo.go:2264)] Get kubelet container image Fs: /opt/docker, eviction config: {"evictionHard":{"imagefs.available":{"Signal":"imagefs.available","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.05}},"memory.available":{
"Signal":"memory.available","Operator":"LessThan","Value":{"Quantity":"100Mi","Percentage":0}},"nodefs.available":{"Signal":"nodefs.available","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.05}},"nodefs.inodesFree":{"Signal":"nodefs.inodesFree","Operator":"LessThan",
"Value":{"Quantity":null,"Percentage":0.05}}}}
[error 2023-07-20 07:10:42 fileutils2.GetAllBlkdevsIoSchedulers(fileutils.go:170)] no block device avaiable
[info 2023-07-20 07:10:42 hostinfo.(*SHostInfo).prepareEnv(hostinfo.go:404)] I/O Scheduler switch to none
[fatal 2023-07-20 07:10:42 hostman.(*SHostService).RunService(host_services.go:79)] Host instance init error: Prepare environment: hugepage 1024 nr 0
>mount |grep -vE 'docker|cgroup|tmpfs'
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
configfs on /sys/kernel/config type configfs (rw,relatime)
/dev/mapper/centos-root on / type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=32,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=27952)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
mqueue on /dev/mqueue type mqueue (rw,relatime)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
/dev/sdb2 on /boot type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime)
fusectl on /sys/fs/fuse/connections type fusectl (rw,relatime)
sdc 为u盘设备。
>ls -l /sys/block/
total 0
lrwxrwxrwx 1 root root 0 Jul 20 12:08 dm-0 -> ../devices/virtual/block/dm-0
lrwxrwxrwx 1 root root 0 Jul 20 12:40 nbd0 -> ../devices/virtual/block/nbd0
lrwxrwxrwx 1 root root 0 Jul 20 12:40 nbd1 -> ../devices/virtual/block/nbd1
lrwxrwxrwx 1 root root 0 Jul 20 12:40 nbd10 -> ../devices/virtual/block/nbd10
lrwxrwxrwx 1 root root 0 Jul 20 12:40 nbd11 -> ../devices/virtual/block/nbd11
lrwxrwxrwx 1 root root 0 Jul 20 12:40 nbd12 -> ../devices/virtual/block/nbd12
lrwxrwxrwx 1 root root 0 Jul 20 12:40 nbd13 -> ../devices/virtual/block/nbd13
lrwxrwxrwx 1 root root 0 Jul 20 12:40 nbd14 -> ../devices/virtual/block/nbd14
lrwxrwxrwx 1 root root 0 Jul 20 12:40 nbd15 -> ../devices/virtual/block/nbd15
lrwxrwxrwx 1 root root 0 Jul 20 12:40 nbd2 -> ../devices/virtual/block/nbd2
lrwxrwxrwx 1 root root 0 Jul 20 12:40 nbd3 -> ../devices/virtual/block/nbd3
lrwxrwxrwx 1 root root 0 Jul 20 12:40 nbd4 -> ../devices/virtual/block/nbd4
lrwxrwxrwx 1 root root 0 Jul 20 12:40 nbd5 -> ../devices/virtual/block/nbd5
lrwxrwxrwx 1 root root 0 Jul 20 12:40 nbd6 -> ../devices/virtual/block/nbd6
lrwxrwxrwx 1 root root 0 Jul 20 12:40 nbd7 -> ../devices/virtual/block/nbd7
lrwxrwxrwx 1 root root 0 Jul 20 12:40 nbd8 -> ../devices/virtual/block/nbd8
lrwxrwxrwx 1 root root 0 Jul 20 12:40 nbd9 -> ../devices/virtual/block/nbd9
lrwxrwxrwx 1 root root 0 Jul 20 12:08 sda -> ../devices/pci0000:00/0000:00:09.0/0000:04:00.0/host0/target0:2:1/0:2:1:0/block/sda
lrwxrwxrwx 1 root root 0 Jul 20 12:08 sdb -> ../devices/pci0000:00/0000:00:09.0/0000:04:00.0/host0/target0:2:0/0:2:0:0/block/sdb
lrwxrwxrwx 1 root root 0 Jul 20 12:09 sdc -> ../devices/pci0000:00/0000:00:1d.7/usb2/2-4/2-4:1.0/host5/target5:0:0/5:0:0:0/block/sdc
拔掉U盘后,delete default-host pod,恢复正常。
环境/Environment:
- OS (e.g.
cat /etc/os-release
): CentOS Linux release 7.9.2009 (Core) - Kernel (e.g.
uname -a
): Linux sz-node-8-5 5.4.130-1.yn20221208.el7.x86_64 #1 SMP Thu Dec 8 12:09:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux - 版本:release/3.10(53a83b59ff23063007)
先关闭,等我再验证一下。
应该是[fatal 2023-07-20 07:43:02 hostman.(*SHostService).RunService(host_services.go:79)] Host instance init error: Prepare environment: hugepage 1024 nr 0
这行日志的问题,no block device avaiable 不会导致退出。
和https://github.com/yunionio/cloudpods/issues/17523 类似问题。
cat /proc/meminfo | grep -i huge
AnonHugePages: 141312 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.4.130-1.yn20221208.el7.x86_64 root=/dev/mapper/centos-root ro rhgb quiet crashkernel=auto rdblacklist=nouveau mgag200.modeset=0 vfio_iommu_type1.allow_unsafe_interrupts=1 intel_iommu=on iommu=pt nouveau.modeset=0 hugepagesz=1G default_hugepagesz=1G
[ 0.464456] Kernel command line: BOOT_IMAGE=/vmlinuz-5.4.130-1.yn20221208.el7.x86_64 root=/dev/mapper/centos-root ro rhgb quiet crashkernel=auto rdblacklist=nouveau mgag200.modeset=0 vfio_iommu_type1.allow_unsafe_interrupts=1 intel_iommu=on iommu=pt nouveau.modeset=0 hugepagesz=1G default_hugepagesz=1G
[ 0.464777] hugepagesz: Unsupported page size 1024 M
cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
看起来这台机器太老了。但是cloudpods 似乎没判断机器是否支持,直接写入hugepagesz=1G 。 @zexi 不知道要不要考虑判断下?(不过我们还在用十几年前的机器,也是没谁了 😅。)
修改/etc/yunion/host.conf
hugepages_option: disable
delete 相关 pod 重新运行后正常了。
@wanyaoqi 看下这个问题,我们能否判断当前操作系统是否支持开启 hugepage ,如果可以判断的话可以避免这个问题。
是应该判断机器是否支持 1G大页,这个我加一下
# cat /proc/cpuinfo | grep pdpe1gb | head -n 1
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d