[求助/Help] GPU设备不能正常识别
宿主机上有两块 GPU 卡,但透传设备中只识别到了一块。
手动添加透传设备后,才可以识别出两块 GPU卡
操作系统版本: Ubuntu 22.04.3 LTS cloudpod 版本: ocboot-3.11.0-20240117
@callme80 需要看一下宿主机 host agent 日志,找一下相关信息
#> @callme80 需要看一下宿主机 host agent 日志,找一下相关信息
[info 240126 03:12:09 procutils.WaitZombieLoop(zombie_others.go:36)] My pid is not 1 and no need to wait zombies [info 240126 03:12:09 options.parseOptions(options.go:331)] Use configuration file: /etc/yunion/host.conf [info 240126 03:12:09 options.parseOptions(options.go:354)] Set log level to "info" [info 2024-01-26 03:12:09 options.parseOptions(options.go:331)] Use configuration file: /etc/yunion/common/common.conf [info 2024-01-26 03:12:09 options.parseOptions(options.go:354)] Set log level to "info" [info 2024-01-26 03:12:09 hostman.(*SHostService).InitService(host_services.go:63)] exec socket path: /var/run/onecloud/exec.sock [info 2024-01-26 03:12:09 app.InitApp(app.go:32)] RequestWorkerCount: 8 [info 2024-01-26 03:12:09 appsrv.NewApplication(appsrv.go:120)] App hostId: O44UlR0JusveLHSNY11R94_mIwI= (host,dc03-node01-cloudpods-203,192.168.110.203) 2024/01/26 03:12:09 Allow hosts [] [info 2024-01-26 03:12:09 appsrv.(*Application).SetDefaultTimeout(appsrv.go:136)] adjust application default timeout to 60.000000 seconds [info 2024-01-26 03:12:09 hostinfo.DetectCpuInfo(hostinfohelper.go:77)] cpuinfo freq 800 [info 2024-01-26 03:12:09 hostinfo.NewHostInfo(hostinfo.go:2339)] CPU Model Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz Microcode 0xd0003b9 [info 2024-01-26 03:12:09 hostinfo.NewHostInfo(hostinfo.go:2359)] Get kubelet container image Fs: /opt/docker, eviction config: {"evictionHard":{"imagefs.available":{"Signal":"imagefs.available","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.05}},"memory.available":{"Signal":"memory.available","Operator":"LessThan","Value":{"Quantity":"100Mi","Percentage":0}},"nodefs.available":{"Signal":"nodefs.available","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.05}},"nodefs.inodesFree":{"Signal":"nodefs.inodesFree","Operator":"LessThan","Value":{"Quantity":null,"Percentage":0.05}}}} [error 2024-01-26 03:12:09 hostinfo.(*SHostInfo).prepareEnv(hostinfo.go:371)] tuned-adm profile virtual-host fail: exec: "tuned-adm": executable file not found in $PATH [info 2024-01-26 03:12:09 hostinfo.(*SHostInfo).prepareEnv(hostinfo.go:402)] I/O Scheduler switch to none [info 2024-01-26 03:12:09 fileutils2.ChangeBlkdevParameter(fileutils.go:202)] Set queue/scheduler of nvme1n1 to none [info 2024-01-26 03:12:09 hostinfo.(*SHostInfo).getKubeReservedMemMb(hostinfo.go:1523)] Kubelet memory threshold subtracted: 100MB [info 2024-01-26 03:12:09 hostinfo.(*SHostInfo).Init(hostinfo.go:196)] Start detectHostInfo [info 2024-01-26 03:12:09 hostinfo.(*SHostInfo).detectKVMMaxCpus(hostinfo.go:863)] KVM API VERSION 12 [info 2024-01-26 03:12:09 hostinfo.(*SHostInfo).detectKVMMaxCpus(hostinfo.go:868)] KVM CAP MAX VCPUS: 1024 [info 2024-01-26 03:12:09 hostinfo.(*SHostInfo).detectKVMMaxCpus(hostinfo.go:876)] KVM CAP NR VCPUS: 710 [info 2024-01-26 03:12:09 sysutils.detectNestSupport(kvm.go:146)] Host is support kvm nest ... [info 2024-01-26 03:12:09 sysutils.detectNestSupport(kvm.go:151)] Host kvm nest is enabled ... [error 2024-01-26 03:12:09 hostinfo.(*SHostInfo).detectOsDist(hostinfo.go:746)] exit status 1 [info 2024-01-26 03:12:09 hostinfo.(*SHostInfo).detectOsDist(hostinfo.go:756)] DetectOsDist [error 2024-01-26 03:12:09 hostinfo.(*SHostInfo).detectOsDist(hostinfo.go:758)] Failed to detect distribution info [warning 2024-01-26 03:12:09 hostinfo.(*SHostInfo).detectOsDist(hostinfo.go:777)] system_service.SetOpenvswitchName to openvswitch-switch [info 2024-01-26 03:12:09 hostinfo.(*SHostInfo).detectQemuVersion(hostinfo.go:830)] Detect qemu version is 4.2.0 [info 2024-01-26 03:12:09 hostinfo.(*SHostInfo).detectOvsVersion(hostinfo.go:971)] Detect OVS version is 2.12.4 [info 2024-01-26 03:12:09 hostinfo.(*SHostInfo).detectOvsKOVersion(hostinfo.go:988)] kernel module openvswitch vermagic: 5.15.0-91-generic SMP mod_unload modversions [info 2024-01-26 03:12:09 hostinfo.(*SHostInfo).Init(hostinfo.go:205)] Start parseConfig [info 2024-01-26 03:12:09 hostinfo.NewNIC(hostinfohelper.go:233)] IP /br0/ens2f0 [info 2024-01-26 03:12:09 netutils2.(*SNetInterface).IsSecretAddress(netutils.go:296)] MASK --- ���� [info 2024-01-26 03:12:09 hostinfo.NewNIC(hostinfohelper.go:283)] Confirm to configuration!! [info 2024-01-26 03:12:09 hostinfo.(*SNIC).SetupDhcpRelay(hostinfohelper.go:202)] Not enable dhcp relay on nic: &hostinfo.SNIC{Inter:"ens2f0", Bridge:"br0", Ip:"", Network:"bcast0", WireId:"", Mask:0, Bandwidth:1000, BridgeDev:(*hostbridge.SOVSBridgeDriver)(0xc001698f90), dhcpServer:(*hostdhcp.SGuestDHCPServer)(0xc0016998f0)} [info 2024-01-26 03:12:09 hostinfo.(*SHostInfo).setupOvnChassis(hostinfo.go:223)] Start setting up ovn chassis [info 2024-01-26 03:12:10 hostman.(*SHostService).RunService.func1(host_services.go:84)] Auth complete!! [info 2024-01-26 03:12:10 policy.(*SPolicyManager).init(policy.go:136)] policy fetch worker count 1 [info 2024-01-26 03:12:10 consts.SetNonDefaultDomainProjects(consts.go:109)] set non_default_domain_projects to false [info 2024-01-26 03:12:10 watcher.(*SInformerSyncManager).startWatcher(watcher.go:83)]EndpointChangeManager: Start resource informer watcher for endpoint [info 2024-01-26 03:12:10 storageman.StartSnapshotRecycle(storage_base.go:491)] Snapshot recyle job started [info 2024-01-26 03:12:10 guestman.NewGuestCpuSetCounter(guesthelper.go:198)] NewGuestCpuSetCounter from topo: {"architecture":"numa","nodes":[{"caches":[{"level":1,"logical_processors":[0,56],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[1,57],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[2,58],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[3,59],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[4,60],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[5,61],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[6,62],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[7,63],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[8,64],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[9,65],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[10,66],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[11,67],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[12,68],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[13,69],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[14,70],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[15,71],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[16,72],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[17,73],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[18,74],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[19,75],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[20,76],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[21,77],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[22,78],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[23,79],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[24,80],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[25,81],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[26,82],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[27,83],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[0,56],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[1,57],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[2,58],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[3,59],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[4,60],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[5,61],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[6,62],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[7,63],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[8,64],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[9,65],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[10,66],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[11,67],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[12,68],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[13,69],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[14,70],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[15,71],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[16,72],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[17,73],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[18,74],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[19,75],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[20,76],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[21,77],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[22,78],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[23,79],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[24,80],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[25,81],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[26,82],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[27,83],"size_bytes":32768,"type":"data"},{"level":2,"logical_processors":[0,56],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[1,57],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[2,58],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[3,59],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[4,60],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[5,61],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[6,62],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[7,63],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[8,64],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[9,65],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[10,66],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[11,67],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[12,68],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[13,69],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[14,70],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[15,71],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[16,72],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[17,73],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[18,74],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[19,75],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[20,76],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[21,77],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[22,78],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[23,79],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[24,80],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[25,81],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[26,82],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[27,83],"size_bytes":1310720,"type":"unified"},{"level":3,"logical_processors":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83],"size_bytes":44040192,"type":"unified"}],"cores":[{"id":0,"index":0,"logical_processors":[0,56],"total_threads":2},{"id":1,"index":1,"logical_processors":[1,57],"total_threads":2},{"id":10,"index":2,"logical_processors":[10,66],"total_threads":2},{"id":11,"index":3,"logical_processors":[11,67],"total_threads":2},{"id":12,"index":4,"logical_processors":[12,68],"total_threads":2},{"id":13,"index":5,"logical_processors":[13,69],"total_threads":2},{"id":14,"index":6,"logical_processors":[14,70],"total_threads":2},{"id":15,"index":7,"logical_processors":[15,71],"total_threads":2},{"id":16,"index":8,"logical_processors":[16,72],"total_threads":2},{"id":17,"index":9,"logical_processors":[17,73],"total_threads":2},{"id":18,"index":10,"logical_processors":[18,74],"total_threads":2},{"id":19,"index":11,"logical_processors":[19,75],"total_threads":2},{"id":2,"index":12,"logical_processors":[2,58],"total_threads":2},{"id":20,"index":13,"logical_processors":[20,76],"total_threads":2},{"id":21,"index":14,"logical_processors":[21,77],"total_threads":2},{"id":22,"index":15,"logical_processors":[22,78],"total_threads":2},{"id":23,"index":16,"logical_processors":[23,79],"total_threads":2},{"id":24,"index":17,"logical_processors":[24,80],"total_threads":2},{"id":25,"index":18,"logical_processors":[25,81],"total_threads":2},{"id":26,"index":19,"logical_processors":[26,82],"total_threads":2},{"id":27,"index":20,"logical_processors":[27,83],"total_threads":2},{"id":3,"index":21,"logical_processors":[3,59],"total_threads":2},{"id":4,"index":22,"logical_processors":[4,60],"total_threads":2},{"id":5,"index":23,"logical_processors":[5,61],"total_threads":2},{"id":6,"index":24,"logical_processors":[6,62],"total_threads":2},{"id":7,"index":25,"logical_processors":[63,7],"total_threads":2},{"id":8,"index":26,"logical_processors":[64,8],"total_threads":2},{"id":9,"index":27,"logical_processors":[65,9],"total_threads":2}],"distances":[10,20],"id":0,"memory":{"supported_page_sizes":[1073741824,2097152],"total_physical_bytes":824633720832,"total_usable_bytes":810988961792}},{"caches":[{"level":1,"logical_processors":[28,84],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[29,85],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[30,86],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[31,87],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[32,88],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[33,89],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[34,90],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[35,91],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[36,92],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[37,93],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[38,94],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[39,95],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[40,96],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[41,97],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[42,98],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[43,99],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[44,100],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[45,101],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[46,102],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[47,103],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[48,104],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[49,105],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[50,106],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[51,107],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[52,108],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[53,109],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[54,110],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[55,111],"size_bytes":32768,"type":"instruction"},{"level":1,"logical_processors":[28,84],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[29,85],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[30,86],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[31,87],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[32,88],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[33,89],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[34,90],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[35,91],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[36,92],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[37,93],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[38,94],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[39,95],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[40,96],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[41,97],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[42,98],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[43,99],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[44,100],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[45,101],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[46,102],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[47,103],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[48,104],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[49,105],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[50,106],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[51,107],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[52,108],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[53,109],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[54,110],"size_bytes":32768,"type":"data"},{"level":1,"logical_processors":[55,111],"size_bytes":32768,"type":"data"},{"level":2,"logical_processors":[28,84],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[29,85],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[30,86],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[31,87],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[32,88],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[33,89],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[34,90],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[35,91],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[36,92],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[37,93],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[38,94],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[39,95],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[40,96],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[41,97],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[42,98],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[43,99],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[44,100],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[45,101],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[46,102],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[47,103],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[48,104],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[49,105],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[50,106],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[51,107],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[52,108],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[53,109],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[54,110],"size_bytes":1310720,"type":"unified"},{"level":2,"logical_processors":[55,111],"size_bytes":1310720,"type":"unified"},{"level":3,"logical_processors":[28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111],"size_bytes":44040192,"type":"unified"}],"cores":[{"id":16,"index":0,"logical_processors":[100,44],"total_threads":2},{"id":17,"index":1,"logical_processors":[101,45],"total_threads":2},{"id":18,"index":2,"logical_processors":[102,46],"total_threads":2},{"id":19,"index":3,"logical_processors":[103,47],"total_threads":2},{"id":20,"index":4,"logical_processors":[104,48],"total_threads":2},{"id":21,"index":5,"logical_processors":[105,49],"total_threads":2},{"id":22,"index":6,"logical_processors":[106,50],"total_threads":2},{"id":23,"index":7,"logical_processors":[107,51],"total_threads":2},{"id":24,"index":8,"logical_processors":[108,52],"total_threads":2},{"id":25,"index":9,"logical_processors":[109,53],"total_threads":2},{"id":26,"index":10,"logical_processors":[110,54],"total_threads":2},{"id":27,"index":11,"logical_processors":[111,55],"total_threads":2},{"id":0,"index":12,"logical_processors":[28,84],"total_threads":2},{"id":1,"index":13,"logical_processors":[29,85],"total_threads":2},{"id":2,"index":14,"logical_processors":[30,86],"total_threads":2},{"id":3,"index":15,"logical_processors":[31,87],"total_threads":2},{"id":4,"index":16,"logical_processors":[32,88],"total_threads":2},{"id":5,"index":17,"logical_processors":[33,89],"total_threads":2},{"id":6,"index":18,"logical_processors":[34,90],"total_threads":2},{"id":7,"index":19,"logical_processors":[35,91],"total_threads":2},{"id":8,"index":20,"logical_processors":[36,92],"total_threads":2},{"id":9,"index":21,"logical_processors":[37,93],"total_threads":2},{"id":10,"index":22,"logical_processors":[38,94],"total_threads":2},{"id":11,"index":23,"logical_processors":[39,95],"total_threads":2},{"id":12,"index":24,"logical_processors":[40,96],"total_threads":2},{"id":13,"index":25,"logical_processors":[41,97],"total_threads":2},{"id":14,"index":26,"logical_processors":[42,98],"total_threads":2},{"id":15,"index":27,"logical_processors":[43,99],"total_threads":2}],"distances":[20,10],"id":1,"memory":{"supported_page_sizes":[1073741824,2097152],"total_physical_bytes":826781204480,"total_usable_bytes":811726753792}}]} [info 2024-01-26 03:12:10 guestman.(*SGuestManager).LoadExistingGuests(guestman.go:404)] Find existing guest 7d563159-790b-4490-890f-ceb2a00a140f [info 2024-01-26 03:12:10 guestman.(*SGuestManager).LoadExistingGuests(guestman.go:404)] Find existing guest c19f8f29-b2cd-490b-8e4c-1b5a1b072305 [info 2024-01-26 03:12:10 hostdhcp.(*SGuestDHCPServer).Start(dhcpserver.go:72)] SGuestDHCPServer starting ... [info 2024-01-26 03:12:10 guestman.(*SGuestManager).InitQemuMaxCpus(guestman.go:143)] KVM max cpus count: 710 [info 2024-01-26 03:12:10 guestman.(*SGuestManager).InitQemuMaxCpus(guestman.go:161)] Machine type pc max cpus: 240 [info 2024-01-26 03:12:10 guestman.(*SGuestManager).InitQemuMaxCpus(guestman.go:161)] Machine type q35 max cpus: 240 [info 2024-01-26 03:12:10 guestman.(*SGuestManager).InitPythonPath(guestman.go:183)] No python found : exit status 1 [info 2024-01-26 03:12:10 guestman.(*SGuestManager).InitPythonPath.func1(guestman.go:176)] Python path /usr/bin/python3 [info 2024-01-26 03:12:10 informer.(*EtcdBackendForClient).StartClientWatch(etcd_client.go:84)] /onecloud/informer watched [info 2024-01-26 03:12:10 informer.NewWatchManagerBySessionBg.func1(watcher.go:51)] callback with watchMan success. [info 2024-01-26 03:12:12 hostinfo.(*SHostInfo).ensureMasterNetworks(hostinfo.go:1206)] Master ip 192.168.110.203 to fetch wire [info 2024-01-26 03:12:12 hostinfo.(*SHostInfo).initZoneInfo(hostinfo.go:1248)] Start GetZoneInfo aa8598a5-8664-4c84-8493-6b44489c4531 [info 2024-01-26 03:12:12 hostinfo.(*SHostInfo).initHostRecord(hostinfo.go:1154)] host health manager on host down [info 2024-01-26 03:12:12 hostinfo.(*SHostInfo).doSendPhysicalNicInfo(hostinfo.go:1660)] upload physical nic: ens27f0(04:32:01:97:2a:90) [info 2024-01-26 03:12:12 hostinfo.(*SHostInfo).doUploadNicInfoInternal(hostinfo.go:1673)] Upload NIC br: if:ens27f0 [info 2024-01-26 03:12:12 hostinfo.(*SHostInfo).doSendPhysicalNicInfo(hostinfo.go:1660)] upload physical nic: ens27f1(04:32:01:97:2a:91) [info 2024-01-26 03:12:12 hostinfo.(*SHostInfo).doUploadNicInfoInternal(hostinfo.go:1673)] Upload NIC br: if:ens27f1 [info 2024-01-26 03:12:12 hostinfo.(*SHostInfo).doSendPhysicalNicInfo(hostinfo.go:1660)] upload physical nic: ens27f2(04:32:01:97:2a:92) [info 2024-01-26 03:12:12 hostinfo.(*SHostInfo).doUploadNicInfoInternal(hostinfo.go:1673)] Upload NIC br: if:ens27f2 [info 2024-01-26 03:12:12 hostinfo.(*SHostInfo).doSendPhysicalNicInfo(hostinfo.go:1660)] upload physical nic: ens27f3(04:32:01:97:2a:93) [info 2024-01-26 03:12:12 hostinfo.(*SHostInfo).doUploadNicInfoInternal(hostinfo.go:1673)] Upload NIC br: if:ens27f3 [info 2024-01-26 03:12:12 hostinfo.(*SHostInfo).doSendPhysicalNicInfo(hostinfo.go:1660)] upload physical nic: ens2f0(6c:fe:54:6b:ac:a4) [info 2024-01-26 03:12:12 hostinfo.(*SHostInfo).doUploadNicInfoInternal(hostinfo.go:1673)] Upload NIC br: if:ens2f0 [info 2024-01-26 03:12:13 hostinfo.(*SHostInfo).doSendPhysicalNicInfo(hostinfo.go:1660)] upload physical nic: ens2f1(6c:fe:54:6b:ac:a5) [info 2024-01-26 03:12:13 hostinfo.(*SHostInfo).doUploadNicInfoInternal(hostinfo.go:1673)] Upload NIC br: if:ens2f1 [info 2024-01-26 03:12:13 isolated_device.(*isolatedDeviceManager).probeCustomPCIDevs(isolated_device.go:184)] Add general pci device: 0 => &isolated_device.sGeneralPCIDevice{sBaseDevice:(*isolated_device.sBaseDevice)(0xc0022f5220)} [info 2024-01-26 03:12:13 isolated_device.(*isolatedDeviceManager).probeCustomPCIDevs(isolated_device.go:184)] Add general pci device: 1 => &isolated_device.sGeneralPCIDevice{sBaseDevice:(*isolated_device.sBaseDevice)(0xc0022f57c0)} [info 2024-01-26 03:12:13 isolated_device.getPassthroughGPUS(gpu.go:75)] filter address [b4:00.1 b5:00.1] [warning 2024-01-26 03:12:13 isolated_device.(*IOMMUGroup).ListDevices(gpu.go:462)] Skip append "0000:02:00.0" iommu_group[19] device {"bus_id":"01:00.0","class_code":"0604","class_name":"PCI bridge","device_id":"0120","device_name":"x1 PCIe Gen2 Bridge[Pilot4]","vendor_id":"19a2","vendor_name":"Emulex Corporation"} [info 2024-01-26 03:12:13 isolated_device.(*PCIDevice).IsBootVGA(gpu.go:307)] PCI address 02:00.0 is boot_vga: /sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0/0000:02:00.0/boot_vga [info 2024-01-26 03:12:13 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:325)] &isolated_device.PCIDevice{Addr:"02:00.0", ClassName:"VGA compatible controller", ClassCode:"0300", VendorName:"Matrox Electronics Systems Ltd.", VendorId:"102b", DeviceName:"MGA G200e [Pilot] ServerEngines (SEP1)", DeviceId:"0522", SubvendorName:"Emulex Corporation", SubvendorId:"19a2", SubdeviceName:"MGA G200e [Pilot] ServerEngines (SEP1)", SubdeviceId:"0101", ModelName:"Pilot", RestIOMMUGroupDevs:[]*isolated_device.PCIDevice{}} is boot vga card, skip it [warning 2024-01-26 03:12:15 isolated_device.getPassthroughGPUS(gpu.go:102)] GPU {"bus_id":"02:00.0","class_code":"0300","class_name":"VGA compatible controller","device_id":"0522","device_name":"MGA G200e [Pilot] ServerEngines (SEP1)","model_name":"Pilot","subdevice_id":"0101","subdevice_name":"MGA G200e [Pilot] ServerEngines (SEP1)","subvendor_id":"19a2","subvendor_name":"Emulex Corporation","vendor_id":"102b","vendor_name":"Matrox Electronics Systems Ltd."} use kernel driver "mgag200", skip it [info 2024-01-26 03:12:15 isolated_device.(*isolatedDeviceManager).probeGPUS(isolated_device.go:161)] Add GPU device: 0 => &isolated_device.PCIDevice{Addr:"b4:00.0", ClassName:"VGA compatible controller", ClassCode:"0300", VendorName:"NVIDIA Corporation", VendorId:"10de", DeviceName:"Device", DeviceId:"2684", SubvendorName:"NVIDIA Corporation", SubvendorId:"10de", SubdeviceName:"Device", SubdeviceId:"16f3", ModelName:"", RestIOMMUGroupDevs:[]*isolated_device.PCIDevice{(*isolated_device.PCIDevice)(0xc0015e4540)}} [info 2024-01-26 03:12:15 isolated_device.SyncDeviceInfo(isolated_device.go:475)] Update fb236eba-7868-45a9-8a67-bc9a6c5bd99c isolated_device: {"addr":"b4:00.1","detected_on_host":true,"dev_type":"GPU","guest_id":"7d563159-790b-4490-890f-ceb2a00a140f","host_id":"93a1e44e-36b7-4e0e-8d7c-4476017176ca","id":"fb236eba-7868-45a9-8a67-bc9a6c5bd99c","model":"Nvidia-4090","vendor_device_id":"10de:22ba"} [info 2024-01-26 03:12:15 isolated_device.SyncDeviceInfo(isolated_device.go:475)] Update 03996d3e-f0d9-48b0-82e1-04b3a61e4046 isolated_device: {"addr":"b5:00.1","detected_on_host":true,"dev_type":"GPU","guest_id":"7d563159-790b-4490-890f-ceb2a00a140f","host_id":"93a1e44e-36b7-4e0e-8d7c-4476017176ca","id":"03996d3e-f0d9-48b0-82e1-04b3a61e4046","model":"Nvidia-4090","vendor_device_id":"10de:22ba"} [info 2024-01-26 03:12:16 isolated_device.(*PCIDevice).forceBindVFIOPCIDriver(gpu.go:329)] {"bus_id":"b4:00.0","class_code":"0300","class_name":"VGA compatible controller","device_id":"2684","device_name":"Device","subdevice_id":"16f3","subdevice_name":"Device","subvendor_id":"10de","subvendor_name":"NVIDIA Corporation","vendor_id":"10de","vendor_name":"NVIDIA Corporation"} already use vfio-pci driver [info 2024-01-26 03:12:16 isolated_device.SyncDeviceInfo(isolated_device.go:475)] Update 4e00fef2-708a-4e45-8539-c252c49eca08 isolated_device: {"addr":"b4:00.0","detected_on_host":true,"dev_type":"GPU-HPC","host_id":"93a1e44e-36b7-4e0e-8d7c-4476017176ca","id":"4e00fef2-708a-4e45-8539-c252c49eca08","model":"Device","vendor_device_id":"10de:2684"} [info 2024-01-26 03:12:16 hostinfo.(*SHostInfo).initStoragesInternal(hostinfo.go:1810)] Storage host_192.168.110.203_local_storage_0(local) mountpoint /opt/cloud/workspace/disks [info 2024-01-26 03:12:16 storageman.(*SLocalStorage).GetAvailSizeMb(storage_local.go:209)] Storage /opt/cloud/workspace/disks and kubelet imageFs /opt/docker share same device /dev/nvme1n1p3 [info 2024-01-26 03:12:16 storageman.(*SLocalStorage).GetAvailSizeMb(storage_local.go:221)] Storage /opt/cloud/workspace/disks sizeMb 4940410, usablePercent 0.950000 [info 2024-01-26 03:12:16 storageman.(*SLocalStorage).SyncStorageInfo(storage_local.go:255)] Sync storage info 51ad991f-2497-4a48-8926-2ce90265a50b/host_192.168.110.203_local_storage_0 [info 2024-01-26 03:12:16 hostinfo.(*SHostInfo).onSyncStorageInfoSucc(hostinfo.go:1889)] storage id 51ad991f-2497-4a48-8926-2ce90265a50b [info 2024-01-26 03:12:16 hostinfo.(*SHostInfo).onSucc(hostinfo.go:2038)] Host registration process success.... [info 2024-01-26 03:12:16 guestman.(*SGuestManager).Bootstrap(guestman.go:240)] Loading existing guests ... [info 2024-01-26 03:12:16 hostinfo.(*SHostPingTask).Start(hostpinger.go:76)] Start host pinger ... [info 2024-01-26 03:12:16 guestman.(*SKVMGuestInstance).ImportServer(qemu-kvm.go:697)] bbb(7d563159-790b-4490-890f-ceb2a00a140f) is stopped, pending_delete=false [info 2024-01-26 03:12:16 guestman.(*SGuestManager).OnLoadExistingGuestsComplete(guestman.go:311)] Load existing guests complete... *[error 2024-01-26 03:12:16 hostinfo.(SHostInfo).PutHostOnline(hostinfo.go:1552)] Host sys error: map[isolated_devices:[{isolated_devices force bind vfio-pci driver b5:00.0: bind {"bus_id":"b5:00.1","class_code":"0403","class_name":"Audio device","device_id":"22ba","device_name":"Device","subdevice_id":"16f3","subdevice_name":"Device","subvendor_id":"10de","subvendor_name":"NVIDIA Corporation","vendor_id":"10de","vendor_name":"NVIDIA Corporation"} vfio-pci driver: bindDriver: write /sys/bus/pci/drivers/vfio-pci/new_id: file exists 2024-01-26 03:12:15.328537819 +0000 UTC m=+6.081455440} {isolated_devices GPU 02:00.0 use kernel driver mgag200, skip it 2024-01-26 03:12:15.328538565 +0000 UTC m=+6.081456186} {isolated_devices b5:00.1 Nvidia-4090 CustomProbe failed bind driver: write /sys/bus/pci/drivers/vfio-pci/new_id: file exists 2024-01-26 03:12:15.404044314 +0000 UTC m=+6.156961935}]] [info 2024-01-26 03:12:16 guestman.(*SKVMGuestInstance).ImportServer(qemu-kvm.go:697)] aaaa(c19f8f29-b2cd-490b-8e4c-1b5a1b072305) is stopped, pending_delete=false [info 2024-01-26 03:12:16 app.InitApp(app.go:32)] RequestWorkerCount: 8 [info 2024-01-26 03:12:16 appsrv.NewApplication(appsrv.go:120)] App hostId: O44UlR0JusveLHSNY11R94_mIwI= (host,dc03-node01-cloudpods-203,192.168.110.203) 2024/01/26 03:12:16 Allow hosts [] [info 2024-01-26 03:12:16 appsrv.(*Application).SetDefaultTimeout(appsrv.go:136)] adjust application default timeout to 60.000000 seconds [info 2024-01-26 03:12:16 app.ServeForeverExtended(app.go:60)] Start listen on https://0.0.0.0:8885, isMaster: true [info 2024-01-26 03:12:16 metadata.Start(metadatahandler.go:46)] Start metadata service on http://0.0.0.0:9885 [info 2024-01-26 03:13:09 ovnutils.configBridgeMtu.func1(ovnutils.go:42)] set brvpc MTU to 1500 success!
*[error 2024-01-26 03:12:16 hostinfo.(SHostInfo).PutHostOnline(hostinfo.go:1552)] Host sys error: map[isolated_devices:[{isolated_devices force bind vfio-pci driver b5:00.0: bind {"bus_id":"b5:00.1","class_code":"0403","class_name":"Audio device","device_id":"22ba","device_name":"Device","subdevice_id":"16f3","subdevice_name":"Device","subvendor_id":"10de","subvendor_name":"NVIDIA Corporation","vendor_id":"10de","vendor_name":"NVIDIA Corporation"} vfio-pci driver: bindDriver: write /sys/bus/pci/drivers/vfio-pci/new_id: file exists 2024-01-26 03:12:15.328537819 +0000 UTC m=+6.081455440} {isolated_devices GPU 02:00.0 use kernel driver mgag200, skip it 2024-01-26 03:12:15.328538565 +0000 UTC m=+6.081456186} {isolated_devices b5:00.1 Nvidia-4090 CustomProbe failed bind driver: write /sys/bus/pci/drivers/vfio-pci/new_id: file exists 2024-01-26 03:12:15.404044314 +0000 UTC m=+6.156961935}]]
目前看这个报错绑定 vfio 驱动报错
你现在探测出来的是 gpu 的一个 function,不是真正的GPU设备,你的 4080 应该是 b4:00.0 和 b5:00.0 不是 b4:00.1 和 b5:00.1 ,你的 device 和 vendor 应该填错了。 你需要删除这两个 custom device 然后重启一下 host-agent 再看下日志。目测是绑 vfio驱动有问题
宿主机上有两块 GPU 卡,但透传设备中只识别到了一块。
手动添加透传设备后,才可以识别出两块 GPU卡
操作系统版本: Ubuntu 22.04.3 LTS cloudpod 版本: ocboot-3.11.0-20240117
我通过以下命令更新过显卡型号内容
yum install pciutils
update-pciids
lspci -nnk | grep NVIDIA -A 3
If you do not provide feedback for more than 37 days, we will close the issue and you can either reopen it or submit a new issue.
您超过 37 天未反馈信息,我们将关闭该 issue,如有需求您可以重新打开或者提交新的 issue。


