HAMi
HAMi copied to clipboard
k3s中hami-device-plugin没有指定runtimeClassName: nvidia导致HAMi安装失败
root@ubuntu:~# kubectl logs hami-device-plugin-wl6xz -n kube-system
Defaulted container "device-plugin" out of: device-plugin, vgpu-monitor
I0212 09:13:34.970923 1104831 flags.go:36] FLAG: --mig-strategy="none"
I0212 09:13:34.971002 1104831 flags.go:36] FLAG: --fail-on-init-error="true"
I0212 09:13:34.971011 1104831 flags.go:36] FLAG: --nvidia-driver-root="/"
I0212 09:13:34.971018 1104831 flags.go:36] FLAG: --pass-device-specs="true"
I0212 09:13:34.971024 1104831 flags.go:36] FLAG: --device-list-strategy="[envvar]"
I0212 09:13:34.971036 1104831 flags.go:36] FLAG: --device-id-strategy="uuid"
I0212 09:13:34.971042 1104831 flags.go:36] FLAG: --gds-enabled="false"
I0212 09:13:34.971048 1104831 flags.go:36] FLAG: --mofed-enabled="false"
I0212 09:13:34.971054 1104831 flags.go:36] FLAG: --config-file="/device-config.yaml"
I0212 09:13:34.971060 1104831 flags.go:36] FLAG: --cdi-annotation-prefix="cdi.k8s.io/"
I0212 09:13:34.971066 1104831 flags.go:36] FLAG: --nvidia-ctk-path="/usr/bin/nvidia-ctk"
I0212 09:13:34.971094 1104831 flags.go:36] FLAG: --container-driver-root="/driver-root"
I0212 09:13:34.971110 1104831 flags.go:36] FLAG: --v="4"
I0212 09:13:34.971130 1104831 flags.go:36] FLAG: --node-name="server1"
I0212 09:13:34.971142 1104831 flags.go:36] FLAG: --device-split-count="2"
I0212 09:13:34.971151 1104831 flags.go:36] FLAG: --device-memory-scaling="1"
I0212 09:13:34.971159 1104831 flags.go:36] FLAG: --device-cores-scaling="1"
I0212 09:13:34.971170 1104831 flags.go:36] FLAG: --disable-core-limit="false"
I0212 09:13:34.971177 1104831 flags.go:36] FLAG: --resource-name="nvidia.com/gpu"
I0212 09:13:34.971189 1104831 flags.go:36] FLAG: --help="false"
I0212 09:13:34.971194 1104831 flags.go:36] FLAG: --h="false"
I0212 09:13:34.971208 1104831 main.go:184] Starting FS watcher.
I0212 09:13:34.971278 1104831 main.go:194] Start working on node server1
I0212 09:13:34.971292 1104831 main.go:195] Starting OS watcher.
I0212 09:13:34.971788 1104831 main.go:210] Starting Plugins.
I0212 09:13:34.973253 1104831 main.go:268] Loading configuration.
I0212 09:13:34.976094 1104831 vgpucfg.go:104] flags= [--mig-strategy value the desired strategy for exposing MIG devices on GPUs that support it:
[none | single | mixed] (default: "none") [$MIG_STRATEGY] --fail-on-init-error fail the plugin if an error is encountered during initialization, otherwise block indefinitely (default: true) [$FAIL_ON_INIT_ERROR] --nvidia-driver-root value the root path for the NVIDIA driver installation (typical values are '/' or '/run/nvidia/driver') (default: "/") [$NVIDIA_DRIVER_ROOT] --pass-device-specs pass the list of DeviceSpecs to the kubelet on Allocate() (default: false) [$PASS_DEVICE_SPECS] --device-list-strategy value [ --device-list-strategy value ] the desired strategy for passing the device list to the underlying runtime:
[envvar | volume-mounts | cdi-annotations] (default: "envvar") [$DEVICE_LIST_STRATEGY] --device-id-strategy value the desired strategy for passing device IDs to the underlying runtime:
[uuid | index] (default: "uuid") [$DEVICE_ID_STRATEGY] --gds-enabled ensure that containers are started with NVIDIA_GDS=enabled (default: false) [$GDS_ENABLED] --mofed-enabled ensure that containers are started with NVIDIA_MOFED=enabled (default: false) [$MOFED_ENABLED] --config-file value the path to a config file as an alternative to command line options or environment variables [$CONFIG_FILE] --cdi-annotation-prefix value the prefix to use for CDI container annotation keys (default: "cdi.k8s.io/") [$CDI_ANNOTATION_PREFIX] --nvidia-ctk-path value the path to use for the nvidia-ctk in the generated CDI specification (default: "/usr/bin/nvidia-ctk") [$NVIDIA_CTK_PATH] --container-driver-root value the path where the NVIDIA driver root is mounted in the container; used for generating CDI specifications (default: "/driver-root") [$CONTAINER_DRIVER_ROOT] -v value number for the log level verbosity (default: 0) --node-name value node name (default: "server1") [$NodeName] --device-split-count value the number for NVIDIA device split (default: 2) [$DEVICE_SPLIT_COUNT] --device-memory-scaling value the ratio for NVIDIA device memory scaling (default: 1) [$DEVICE_MEMORY_SCALING] --device-cores-scaling value the ratio for NVIDIA device cores scaling (default: 1) [$DEVICE_CORES_SCALING] --disable-core-limit If set, the core utilization limit will be ignored (default: false) [$DISABLE_CORE_LIMIT] --resource-name value the name of field for number GPU visible in container (default: "nvidia.com/gpu") --help, -h show help]
I0212 09:13:34.976344 1104831 devices.go:375] Reading config file from path: /device-config.yaml
I0212 09:13:34.977024 1104831 devices.go:385] Successfully read and parsed config file
I0212 09:13:34.977046 1104831 vgpucfg.go:119] reading config= nvidia.com/gpu devcfg nvidia.com/gpu configfile= /device-config.yaml
I0212 09:13:34.977053 1104831 main.go:284] Updating config with default resource matching patterns.
config= [{* nvidia.com/gpu}]
I0212 09:13:34.977175 1104831 main.go:295]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": true,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"useNodeFeatureAPI": null,
"plugin": {
"passDeviceSpecs": true,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
},
"ResourceName": "nvidia.com/gpu",
"DebugMode": null
}
I0212 09:13:34.977195 1104831 main.go:298] Retrieving plugins.
W0212 09:13:34.977512 1104831 factory.go:47] No valid resources detected, creating a null CDI handler
I0212 09:13:34.977583 1104831 factory.go:123] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0212 09:13:34.977620 1104831 factory.go:123] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0212 09:13:34.977636 1104831 factory.go:131] Incompatible platform detected
E0212 09:13:34.977642 1104831 factory.go:132] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0212 09:13:34.977648 1104831 factory.go:133] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0212 09:13:34.977654 1104831 factory.go:134] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0212 09:13:34.977660 1104831 factory.go:135] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0212 09:13:34.977860 1104831 main.go:153] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed
手动设置runtimeClassName: nvidia后安装成功,这是BUG吗?