k8s-device-plugin
k8s-device-plugin copied to clipboard
Run Locally instructions fail
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Issue or feature description
I followed the QuickStart using Ubuntu 18.04. Under the "With Docker" section I pulled the prebuilt image with
docker pull nvidia/k8s-device-plugin:1.0.0-beta6
Then following the instruction under "Run Locally"
docker run --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.0.0-beta6
This spooled the following error messages:
2020/06/01 17:19:46 Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?
2020/06/01 17:19:46 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2020/06/01 17:19:46 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2020/06/01 17:19:46 Could not start device plugin for 'nvidia.com/gpu': listen unix /var/lib/kubelet/device-plugins/nvidia.sock: bind: address already in use
Does this mean that I should not execute the instructions under the "With Docker" section after executing the Quick Start instructions?
When I deployed the sample .yaml under the QuickStart my pod is stuck in pending state. Therefore I tried the instructions under "With Docker" to see if it would move my pod out of pending. It's not clear in the readme which steps are necessary to run the sample .yaml file, but since it is stuck in pending there must be some missing steps.
Common error checking:
- [ ] The output of
nvidia-smi -aon your host ==============NVSMI LOG==============
Timestamp : Mon Jun 1 11:39:09 2020 Driver Version : 440.59 CUDA Version : 10.2
Attached GPUs : 1
GPU 00000000:09:00.0
Product Name : Quadro P6000
Product Brand : Quadro
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0323317053327
GPU UUID : GPU-5098b849-e803-325f-eca0-392e924f5111
Minor Number : 0
VBIOS Version : 86.02.2D.00.04
MultiGPU Board : No
Board ID : 0x900
GPU Part Number : 900-5G611-0000-000
Inforom Version
Image Version : G611.0500.00.02
OEM Object : 1.1
ECC Object : 4.1
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x09
Device : 0x00
Domain : 0x0000
Device Id : 0x1B3010DE
Bus Id : 00000000:09:00.0
Sub System Id : 0x11A010DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 26 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 24449 MiB
Used : 0 MiB
Free : 24449 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 2 MiB
Free : 254 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Temperature
GPU Current Temp : 17 C
GPU Shutdown Temp : 100 C
GPU Slowdown Temp : 97 C
GPU Max Operating Temp : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 9.36 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 250.00 W
Clocks
Graphics : 139 MHz
SM : 139 MHz
Memory : 405 MHz
Video : 544 MHz
Applications Clocks
Graphics : 1506 MHz
Memory : 4513 MHz
Default Applications Clocks
Graphics : 1506 MHz
Memory : 4513 MHz
Max Clocks
Graphics : 1657 MHz
SM : 1657 MHz
Memory : 4513 MHz
Video : 1493 MHz
Max Customer Boost Clocks
Graphics : 1657 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None
-
[ ] Your docker configuration file (e.g:
/etc/docker/daemon.json) { "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } } -
[ ] The k8s-device-plugin container logs Where are these logs?
-
[ ] The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet) Jun 01 11:41:18 fabricnode1 systemd[1]: Stopping kubelet: The Kubernetes Node Agent... Jun 01 11:41:18 fabricnode1 systemd[1]: Stopped kubelet: The Kubernetes Node Agent. Jun 01 11:41:18 fabricnode1 systemd[1]: Started kubelet: The Kubernetes Node Agent. Jun 01 11:41:18 fabricnode1 kubelet[11402]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information. Jun 01 11:41:18 fabricnode1 kubelet[11402]: Flag --resolv-conf has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information. Jun 01 11:41:18 fabricnode1 kubelet[11402]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information. Jun 01 11:41:18 fabricnode1 kubelet[11402]: Flag --resolv-conf has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information. Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.539127 11402 server.go:417] Version: v1.18.3 Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.539346 11402 plugins.go:100] No cloud provider specified. Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.539374 11402 server.go:837] Client rotation is on, will bootstrap in background Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.546268 11402 certificate_store.go:130] Loading cert/key pair from "/var/lib/kubelet/pki/kubelet-client-current.pem". Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.578290 11402 server.go:646] --cgroups-per-qos enabled, but --cgroup-root was not specified. defaulting to / Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.578588 11402 container_manager_linux.go:266] container manager verified user specified cgroup-root exists: [] Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.578597 11402 container_manager_linux.go:271] Creating Container Manager object based on Node Config: {RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName: ContainerRuntime:docker CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:cgroupfs KubeletRootDir:/var/lib/kubelet ProtectKernelDefaults:false NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: ReservedSystemCPUs: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[] SystemReserved:map[] HardEvictionThresholds:[{Signal:memory.available Operator:LessThan Value:{Quantity:100Mi Percentage:0} GracePeriod:0s MinReclaim:} {Signal:nodefs.available Operator:LessThan Value:{Quantity: Percentage:0.1} GracePeriod:0s MinReclaim: } {Signal:nodefs.inodesFree Operator:LessThan Value:{Quantity: Percentage:0.05} GracePeriod:0s MinReclaim: } {Signal:imagefs.available Operator:LessThan Value:{Quantity: Percentage:0.15} GracePeriod:0s MinReclaim: }]} QOSReserved:map[] ExperimentalCPUManagerPolicy:none ExperimentalCPUManagerReconcilePeriod:10s ExperimentalPodPidsLimit:-1 EnforceCPULimits:true CPUCFSQuotaPeriod:100ms ExperimentalTopologyManagerPolicy:none} Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.578699 11402 topology_manager.go:126] [topologymanager] Creating topology manager with none policy Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.578708 11402 container_manager_linux.go:301] [topologymanager] Initializing Topology Manager with none policy Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.578712 11402 container_manager_linux.go:306] Creating device plugin manager: true Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.578775 11402 client.go:75] Connecting to docker on unix:///var/run/docker.sock Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.578782 11402 client.go:92] Start docker client with request timeout=2m0s Jun 01 11:41:18 fabricnode1 kubelet[11402]: W0601 11:41:18.582204 11402 docker_service.go:561] Hairpin mode set to "promiscuous-bridge" but kubenet is not enabled, falling back to "hairpin-veth" Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.582219 11402 docker_service.go:238] Hairpin mode set to "hairpin-veth" Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.604484 11402 docker_service.go:253] Docker cri networking managed by cni Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.608821 11402 docker_service.go:258] Docker Info: &{ID:3OTC:YDMK:IDO4:VWW2:2UWQ:ZEX5:3NBQ:7BWF:GVZD:R7GT:EUTB:OMVB Containers:16 ContainersRunning:6 ContainersPaused:0 ContainersStopped:10 Images:7 Driver:overlay2 DriverStatus:[[Backing Filesystem extfs] [Supports d_type true] [Native Overlay Diff true]] SystemStatus:[] Plugins:{Volume:[local] Network:[bridge host ipvlan macvlan null overlay] Authorization:[] Log:[awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog]} MemoryLimit:true SwapLimit:true KernelMemory:true KernelMemoryTCP:true CPUCfsPeriod:true CPUCfsQuota:true CPUShares:true CPUSet:true PidsLimit:true IPv4Forwarding:true BridgeNfIptables:true BridgeNfIP6tables:true Debug:false NFd:54 OomKillDisable:true NGoroutines:60 SystemTime:2020-06-01T11:41:18.605245687-06:00 LoggingDriver:json-file CgroupDriver:cgroupfs NEventsListener:0 KernelVersion:5.3.0-53-generic OperatingSystem:Ubuntu 18.04.2 LTS OSType:linux Architecture:x86_64 IndexServerAddress:https://index.docker.io/v1/ RegistryConfig:0xc0002ba000 NCPU:12 MemTotal:8181821440 GenericResources:[] DockerRootDir:/var/lib/docker HTTPProxy: HTTPSProxy: NoProxy: Name:fabricnode1 Labels:[] ExperimentalBuild:false ServerVersion:19.03.6 ClusterStore: ClusterAdvertise: Runtimes:map[nvidia:{Path:/usr/bin/nvidia-container-runtime Args:[]} runc:{Path:runc Args:[]}] DefaultRuntime:nvidia Swarm:{NodeID: NodeAddr: LocalNodeState:inactive ControlAvailable:false Error: RemoteManagers:[] Nodes:0 Managers:0 Cluster: Warnings:[]} LiveRestoreEnabled:false Isolation: InitBinary:docker-init ContainerdCommit:{ID: Expected:} RuncCommit:{ID: Expected:} InitCommit:{ID: Expected:} SecurityOptions:[name=apparmor name=seccomp,profile=default] ProductLicense: Warnings:[]} Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.608884 11402 docker_service.go:271] Setting cgroupDriver to cgroupfs Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.614270 11402 remote_runtime.go:59] parsed scheme: "" Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.614282 11402 remote_runtime.go:59] scheme "" not registered, fallback to default scheme Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.614299 11402 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/run/dockershim.sock 0 }] } Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.614304 11402 clientconn.go:933] ClientConn switching balancer to "pick_first" Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.614333 11402 remote_image.go:50] parsed scheme: "" Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.614338 11402 remote_image.go:50] scheme "" not registered, fallback to default scheme Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.614343 11402 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/run/dockershim.sock 0 }] } Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.614346 11402 clientconn.go:933] ClientConn switching balancer to "pick_first" Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.614366 11402 kubelet.go:292] Adding pod path: /etc/kubernetes/manifests Jun 01 11:41:18 fabricnode1 kubelet[11402]: I0601 11:41:18.614381 11402 kubelet.go:317] Watching apiserver Jun 01 11:41:24 fabricnode1 kubelet[11402]: E0601 11:41:24.752088 11402 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated. Jun 01 11:41:24 fabricnode1 kubelet[11402]: For verbose messaging see aws.Config.CredentialsChainVerboseErrors Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.774115 11402 kuberuntime_manager.go:211] Container runtime docker initialized, version: 19.03.6, apiVersion: 1.40.0 Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.774480 11402 server.go:1125] Started kubelet Jun 01 11:41:24 fabricnode1 kubelet[11402]: E0601 11:41:24.774504 11402 kubelet.go:1305] Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data in memory cache Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.774542 11402 server.go:145] Starting to listen on 0.0.0.0:10250 Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.775231 11402 server.go:393] Adding debug handlers to kubelet server. Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.775240 11402 fs_resource_analyzer.go:64] Starting FS ResourceAnalyzer Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.775273 11402 volume_manager.go:265] Starting Kubelet Volume Manager Jun 01 11:41:24 fabricnode1 kubelet[11402]: W0601 11:41:24.775302 11402 oomparser.go:150] error reading /dev/kmsg: read /dev/kmsg: broken pipe Jun 01 11:41:24 fabricnode1 kubelet[11402]: E0601 11:41:24.775325 11402 oomparser.go:126] exiting analyzeLines. OOM events will not be reported. Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.775368 11402 desired_state_of_world_populator.go:139] Desired state populator starts to run Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.780426 11402 clientconn.go:106] parsed scheme: "unix" Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.780443 11402 clientconn.go:106] scheme "unix" not registered, fallback to default scheme Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.780512 11402 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock 0 }] } Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.780521 11402 clientconn.go:933] ClientConn switching balancer to "pick_first" Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.786833 11402 status_manager.go:158] Starting to sync pod status with apiserver Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.786855 11402 kubelet.go:1821] Starting kubelet main sync loop. Jun 01 11:41:24 fabricnode1 kubelet[11402]: E0601 11:41:24.786885 11402 kubelet.go:1845] skipping pod synchronization - [container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful] Jun 01 11:41:24 fabricnode1 kubelet[11402]: W0601 11:41:24.794379 11402 docker_sandbox.go:400] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "nvidia-device-plugin-daemonset-ndv85_kube-system": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "a6b7227fc168e0790ae61ea2a9407fe916e5e92b5b23d0106b40f3e60474d486" Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.846340 11402 cpu_manager.go:184] [cpumanager] starting with none policy Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.846350 11402 cpu_manager.go:185] [cpumanager] reconciling every 10s Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.846359 11402 state_mem.go:36] [cpumanager] initializing new in-memory state store Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.846477 11402 state_mem.go:88] [cpumanager] updated default cpuset: "" Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.846482 11402 state_mem.go:96] [cpumanager] updated cpuset assignments: "map[]" Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.846488 11402 policy_none.go:43] [cpumanager] none policy: Start Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.847218 11402 plugin_manager.go:114] Starting Kubelet Plugin Manager Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.847969 11402 manager.go:411] Got registration request from device plugin with resource name "nvidia.com/gpu" Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.848046 11402 endpoint.go:179] parsed scheme: "" Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.848055 11402 endpoint.go:179] scheme "" not registered, fallback to default scheme Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.848066 11402 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/device-plugins/nvidia.sock 0 }] } Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.848073 11402 clientconn.go:933] ClientConn switching balancer to "pick_first" Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.875429 11402 kubelet_node_status.go:294] Setting node annotation to enable volume controller attach/detach Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.887027 11402 topology_manager.go:233] [topologymanager] Topology Admit Handler Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.887494 11402 kubelet_node_status.go:70] Attempting to register node fabricnode1 Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.888811 11402 topology_manager.go:233] [topologymanager] Topology Admit Handler Jun 01 11:41:24 fabricnode1 kubelet[11402]: I0601 11:41:24.891518 11402 topology_manager.go:233] [topologymanager] Topology Admit Handler Jun 01 11:41:24 fabricnode1 kubelet[11402]: W0601 11:41:24.892140 11402 pod_container_deletor.go:77] Container "a6b7227fc168e0790ae61ea2a9407fe916e5e92b5b23d0106b40f3e60474d486" not found in pod's containers Jun 01 11:41:24 fabricnode1 kubelet[11402]: W0601 11:41:24.892188 11402 pod_container_deletor.go:77] Container "ab891de41b05beab10602c4eb8eb7fa4903afc0c9abd830a280724baa2b42513" not found in pod's containers Jun 01 11:41:24 fabricnode1 kubelet[11402]: W0601 11:41:24.892349 11402 pod_container_deletor.go:77] Container "0922f56bd4b0fe7014daba26f6b050a47ccbfd340050bded9bb440b3f55845a8" not found in pod's containers Jun 01 11:41:25 fabricnode1 kubelet[11402]: I0601 11:41:25.075659 11402 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-bthvt" (UniqueName: "kubernetes.io/secret/ad441fc7-d532-4a6c-bfff-4f583f8fc3d1-default-token-bthvt") pod "nvidia-device-plugin-daemonset-ndv85" (UID: "ad441fc7-d532-4a6c-bfff-4f583f8fc3d1") Jun 01 11:41:25 fabricnode1 kubelet[11402]: I0601 11:41:25.075683 11402 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "cni-bin-dir" (UniqueName: "kubernetes.io/host-path/b66b63b7-5bad-4f35-9a84-187eeba03104-cni-bin-dir") pod "calico-node-rtwq9" (UID: "b66b63b7-5bad-4f35-9a84-187eeba03104") Jun 01 11:41:25 fabricnode1 kubelet[11402]: I0601 11:41:25.075698 11402 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "policysync" (UniqueName: "kubernetes.io/host-path/b66b63b7-5bad-4f35-9a84-187eeba03104-policysync") pod "calico-node-rtwq9" (UID: "b66b63b7-5bad-4f35-9a84-187eeba03104") Jun 01 11:41:25 fabricnode1 kubelet[11402]: I0601 11:41:25.075728 11402 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "flexvol-driver-host" (UniqueName: "kubernetes.io/host-path/b66b63b7-5bad-4f35-9a84-187eeba03104-flexvol-driver-host") pod "calico-node-rtwq9" (UID: "b66b63b7-5bad-4f35-9a84-187eeba03104") Jun 01 11:41:25 fabricnode1 kubelet[11402]: I0601 11:41:25.075756 11402 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "device-plugin" (UniqueName: "kubernetes.io/host-path/ad441fc7-d532-4a6c-bfff-4f583f8fc3d1-device-plugin") pod "nvidia-device-plugin-daemonset-ndv85" (UID: "ad441fc7-d532-4a6c-bfff-4f583f8fc3d1") Jun 01 11:41:25 fabricnode1 kubelet[11402]: I0601 11:41:25.075777 11402 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "lib-modules" (UniqueName: "kubernetes.io/host-path/679c28eb-80ce-4bc3-aaba-9e6db839417a-lib-modules") pod "kube-proxy-btwvv" (UID: "679c28eb-80ce-4bc3-aaba-9e6db839417a") Jun 01 11:41:25 fabricnode1 kubelet[11402]: I0601 11:41:25.075797 11402 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "kube-proxy-token-64vmx" (UniqueName: "kubernetes.io/secret/679c28eb-80ce-4bc3-aaba-9e6db839417a-kube-proxy-token-64vmx") pod "kube-proxy-btwvv" (UID: "679c28eb-80ce-4bc3-aaba-9e6db839417a") Jun 01 11:41:25 fabricnode1 kubelet[11402]: I0601 11:41:25.075815 11402 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "var-run-calico" (UniqueName: "kubernetes.io/host-path/b66b63b7-5bad-4f35-9a84-187eeba03104-var-run-calico") pod "calico-node-rtwq9" (UID: "b66b63b7-5bad-4f35-9a84-187eeba03104") Jun 01 11:41:25 fabricnode1 kubelet[11402]: I0601 11:41:25.075834 11402 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "host-local-net-dir" (UniqueName: "kubernetes.io/host-path/b66b63b7-5bad-4f35-9a84-187eeba03104-host-local-net-dir") pod "calico-node-rtwq9" (UID: "b66b63b7-5bad-4f35-9a84-187eeba03104") Jun 01 11:41:25 fabricnode1 kubelet[11402]: I0601 11:41:25.075852 11402 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "kube-proxy" (UniqueName: "kubernetes.io/configmap/679c28eb-80ce-4bc3-aaba-9e6db839417a-kube-proxy") pod "kube-proxy-btwvv" (UID: "679c28eb-80ce-4bc3-aaba-9e6db839417a") Jun 01 11:41:25 fabricnode1 kubelet[11402]: I0601 11:41:25.075872 11402 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "xtables-lock" (UniqueName: "kubernetes.io/host-path/679c28eb-80ce-4bc3-aaba-9e6db839417a-xtables-lock") pod "kube-proxy-btwvv" (UID: "679c28eb-80ce-4bc3-aaba-9e6db839417a") Jun 01 11:41:25 fabricnode1 kubelet[11402]: I0601 11:41:25.075893 11402 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "var-lib-calico" (UniqueName: "kubernetes.io/host-path/b66b63b7-5bad-4f35-9a84-187eeba03104-var-lib-calico") pod "calico-node-rtwq9" (UID: "b66b63b7-5bad-4f35-9a84-187eeba03104") Jun 01 11:41:25 fabricnode1 kubelet[11402]: I0601 11:41:25.075911 11402 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "xtables-lock" (UniqueName: "kubernetes.io/host-path/b66b63b7-5bad-4f35-9a84-187eeba03104-xtables-lock") pod "calico-node-rtwq9" (UID: "b66b63b7-5bad-4f35-9a84-187eeba03104") Jun 01 11:41:25 fabricnode1 kubelet[11402]: I0601 11:41:25.075941 11402 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "lib-modules" (UniqueName: "kubernetes.io/host-path/b66b63b7-5bad-4f35-9a84-187eeba03104-lib-modules") pod "calico-node-rtwq9" (UID: "b66b63b7-5bad-4f35-9a84-187eeba03104") Jun 01 11:41:25 fabricnode1 kubelet[11402]: I0601 11:41:25.075962 11402 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "cni-net-dir" (UniqueName: "kubernetes.io/host-path/b66b63b7-5bad-4f35-9a84-187eeba03104-cni-net-dir") pod "calico-node-rtwq9" (UID: "b66b63b7-5bad-4f35-9a84-187eeba03104") Jun 01 11:41:25 fabricnode1 kubelet[11402]: I0601 11:41:25.075982 11402 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "calico-node-token-2bwnv" (UniqueName: "kubernetes.io/secret/b66b63b7-5bad-4f35-9a84-187eeba03104-calico-node-token-2bwnv") pod "calico-node-rtwq9" (UID: "b66b63b7-5bad-4f35-9a84-187eeba03104") Jun 01 11:41:25 fabricnode1 kubelet[11402]: I0601 11:41:25.075997 11402 reconciler.go:157] Reconciler: start to sync state Jun 01 11:41:25 fabricnode1 kubelet[11402]: I0601 11:41:25.378188 11402 kubelet_node_status.go:112] Node fabricnode1 was previously registered Jun 01 11:41:25 fabricnode1 kubelet[11402]: I0601 11:41:25.378234 11402 kubelet_node_status.go:73] Successfully registered node fabricnode1
Additional information that might help better understand your environment and reproduce the bug:
-
[ ] Docker version from
docker version19.03.6 -
[ ] Docker command, image and tag used Installed with this command:
sudo apt install -y nvidia-docker2 -
[ ] Kernel version from
uname -aLinux fabricnode1 5.3.0-53-generic #47~18.04.1-Ubuntu SMP Thu May 7 13:10:50 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux -
[ ] Any relevant kernel output lines from
dmesg -
[ ] NVIDIA packages version from
dpkg -l '*nvidia*'orrpm -qa '*nvidia*'$ dpkg -l 'nvidia' Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-===========================-==================-==================-============================================================ un libgldispatch0-nvidia(no description available) ii libnvidia-cfg1-440:amd64 440.59-0ubuntu0.18 amd64 NVIDIA binary OpenGL/GLX configuration library un libnvidia-cfg1-any (no description available) un libnvidia-common (no description available) ii libnvidia-common-440 440.59-0ubuntu0.18 all Shared files used by the NVIDIA libraries rc libnvidia-compute-435:amd64 435.21-0ubuntu0.18 amd64 NVIDIA libcompute package ii libnvidia-compute-440:amd64 440.59-0ubuntu0.18 amd64 NVIDIA libcompute package ii libnvidia-compute-440:i386 440.59-0ubuntu0.18 i386 NVIDIA libcompute package ii libnvidia-container-tools 1.1.1-1 amd64 NVIDIA container runtime library (command-line tools) ii libnvidia-container1:amd64 1.1.1-1 amd64 NVIDIA container runtime library un libnvidia-decode (no description available) ii libnvidia-decode-440:amd64 440.59-0ubuntu0.18 amd64 NVIDIA Video Decoding runtime libraries ii libnvidia-decode-440:i386 440.59-0ubuntu0.18 i386 NVIDIA Video Decoding runtime libraries un libnvidia-encode (no description available) ii libnvidia-encode-440:amd64 440.59-0ubuntu0.18 amd64 NVENC Video Encoding runtime library ii libnvidia-encode-440:i386 440.59-0ubuntu0.18 i386 NVENC Video Encoding runtime library un libnvidia-fbc1 (no description available) ii libnvidia-fbc1-440:amd64 440.59-0ubuntu0.18 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library ii libnvidia-fbc1-440:i386 440.59-0ubuntu0.18 i386 NVIDIA OpenGL-based Framebuffer Capture runtime library un libnvidia-gl (no description available) ii libnvidia-gl-440:amd64 440.59-0ubuntu0.18 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD ii libnvidia-gl-440:i386 440.59-0ubuntu0.18 i386 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD un libnvidia-ifr1 (no description available) ii libnvidia-ifr1-440:amd64 440.59-0ubuntu0.18 amd64 NVIDIA OpenGL-based Inband Frame Readback runtime library ii libnvidia-ifr1-440:i386 440.59-0ubuntu0.18 i386 NVIDIA OpenGL-based Inband Frame Readback runtime library un libnvidia-ml1 (no description available) un nvidia-304 (no description available) un nvidia-340 (no description available) un nvidia-384 (no description available) un nvidia-390 (no description available) un nvidia-common (no description available) ii nvidia-compute-utils-440 440.59-0ubuntu0.18 amd64 NVIDIA compute utilities ii nvidia-container-runtime 3.2.0-1 amd64 NVIDIA container runtime un nvidia-container-runtime-ho (no description available) ii nvidia-container-toolkit 1.1.1-1 amd64 NVIDIA container runtime hook ii nvidia-cuda-dev 9.1.85-3ubuntu1 amd64 NVIDIA CUDA development files ii nvidia-cuda-doc 9.1.85-3ubuntu1 all NVIDIA CUDA and OpenCL documentation ii nvidia-cuda-gdb 9.1.85-3ubuntu1 amd64 NVIDIA CUDA Debugger (GDB) ii nvidia-cuda-toolkit 9.1.85-3ubuntu1 amd64 NVIDIA CUDA development toolkit ii nvidia-dkms-440 440.59-0ubuntu0.18 amd64 NVIDIA DKMS package un nvidia-dkms-kernel (no description available) un nvidia-docker (no description available) ii nvidia-docker2 2.3.0-1 all nvidia-docker CLI wrapper un nvidia-driver (no description available) ii nvidia-driver-440 440.59-0ubuntu0.18 amd64 NVIDIA driver metapackage un nvidia-driver-binary (no description available) un nvidia-kernel-common (no description available) ii nvidia-kernel-common-440 440.59-0ubuntu0.18 amd64 Shared files used with the kernel module un nvidia-kernel-source (no description available) ii nvidia-kernel-source-440 440.59-0ubuntu0.18 amd64 NVIDIA kernel source package un nvidia-legacy-340xx-vdpau-d (no description available) un nvidia-libopencl1 (no description available) un nvidia-libopencl1-dev (no description available) ii nvidia-opencl-dev:amd64 9.1.85-3ubuntu1 amd64 NVIDIA OpenCL development files un nvidia-opencl-icd (no description available) un nvidia-persistenced (no description available) ii nvidia-prime 0.8.8.2 all Tools to enable NVIDIA's Prime ii nvidia-profiler 9.1.85-3ubuntu1 amd64 NVIDIA Profiler for CUDA and OpenCL ii nvidia-settings 440.44-0ubuntu0.18 amd64 Tool for configuring the NVIDIA graphics driver un nvidia-settings-binary (no description available) un nvidia-smi (no description available) un nvidia-utils (no description available) ii nvidia-utils-440 440.59-0ubuntu0.18 amd64 NVIDIA driver support binaries un nvidia-vdpau-driver (no description available) ii nvidia-visual-profiler 9.1.85-3ubuntu1 amd64 NVIDIA Visual Profiler for CUDA and OpenCL ii xserver-xorg-video-nvidia-4 440.59-0ubuntu0.18 amd64 NVIDIA binary Xorg driver -
[ ] NVIDIA container library version from
nvidia-container-cli -V$ nvidia-container-cli -V version: 1.1.1 build date: 2020-05-19T15:15+00:00 build revision: e5d6156aba457559979597c8e3d22c5d8d0622db build compiler: x86_64-linux-gnu-gcc-7 7.5.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections -
[ ] NVIDIA container library logs (see troubleshooting)
The container has never run (stuck in pending state)
Can you try removing the file:
/var/lib/kubelet/device-plugins/nvidia.sock
This socket file is created dynamically by the plugin (and removed automatically if the plugin terminates gracefully). It should probably be removed automatically in in cases where the plugin is not terminated gracefully, but it doesn't do that at the moment.
That file doesn't exist (on nodes or master) so it looks like the plugin removed it:
$ sudo ls -l /var/lib/kubelet/device-plugins/ total 4 -rw-r--r-- 1 root root 0 Jun 8 08:54 DEPRECATION -rw------- 1 root root 361 Jun 8 09:26 kubelet_internal_checkpoint srwxr-xr-x 1 root root 0 Jun 8 08:54 kubelet.sock
On Mon, Jun 8, 2020 at 10:26 AM Kevin Klues [email protected] wrote:
Can you try removing the file:
/var/lib/kubelet/device-plugins/nvidia.sock
This should probably be done automatically in the plugin instead of erroring out,
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/k8s-device-plugin/issues/178#issuecomment-640734217, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK46CRDPSFTUSFFKEHVSCLRVUGKBANCNFSM4NQAKXKA .
If that's the case, then this error message makes no sense:
2020/06/01 17:19:46 Could not start device plugin for 'nvidia.com/gpu': listen unix /var/lib/kubelet/device-plugins/nvidia.sock: bind: address already in use
Can you make sure that this file is not present, and then restart the plugin and see if you continue to get the error message.
Is this still an issue for you?
I think we solved this in the discussion of issue #176. Creating / deleting the daemonset creates / deletes the /var/lib/kubelet/device-plugins/nvidia.sock file. The Run Locally instructions probably require that haven’t created the daemonset .
The readme should be clear about this. If you do the QuickStart you don’t have to do the Build sections. If you want to Build you don’t create the daemonset first.
Pods still stay in the pending state, though.
From: Kevin Klues [email protected] Sent: Wednesday, June 10, 2020 7:09 AM To: NVIDIA/k8s-device-plugin [email protected] Cc: dwschulze [email protected]; Author [email protected] Subject: Re: [NVIDIA/k8s-device-plugin] Run Locally instructions fail (#178)
Is this still an issue for you?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NVIDIA/k8s-device-plugin/issues/178#issuecomment-641994802 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AAK46CXZO4SR3OMBD7IKUC3RV6AWVANCNFSM4NQAKXKA . https://github.com/notifications/beacon/AAK46CWU2ZAAQBNGNMPLEV3RV6AWVA5CNFSM4NQAKXKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEZCBAMQ.gif
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
This issue was automatically closed due to inactivity.