kubernetes
kubernetes copied to clipboard
cri-containerd doesnt successfully reboot kubelet
Description
Steps to reproduce the issue: start master, wait until all pods are started, poweroff and restart. In ~<10% it reboots successfully
Describe the results you received: kubelet log
kubelet.sh: kubelet already configured
kubelet.sh: waiting for /etc/kubernetes/kubelet.conf
kubelet.sh: /etc/kubernetes/kubelet.conf has arrived
I1213 19:31:02.058981 572 feature_gate.go:156] feature gates: map[]
I1213 19:31:02.059026 572 controller.go:114] kubelet config controller: starting controller
I1213 19:31:02.059031 572 controller.go:118] kubelet config controller: validating combination of defaults and flags
I1213 19:31:02.075793 572 feature_gate.go:156] feature gates: map[]
W1213 19:31:02.075912 572 server.go:289] --cloud-provider=auto-detect is deprecated. The desired cloud provider should be set e
xplicitly
I1213 19:31:02.079398 572 manager.go:149] cAdvisor running in container: "/sys/fs/cgroup/cpu"
W1213 19:31:02.080964 572 manager.go:153] Unable to connect to Docker: Cannot connect to the Docker daemon. Is the docker daemo
n running on this host?
W1213 19:31:02.082506 572 manager.go:157] unable to connect to Rkt api service: rkt: cannot tcp Dial rkt api service: dial tcp
[::1]:15441: getsockopt: connection refused
W1213 19:31:02.082564 572 manager.go:166] unable to connect to CRI-O api service: Get http://%2Fvar%2Frun%2Fcrio.sock/info: dia
l unix /var/run/crio.sock: connect: no such file or directory
I1213 19:31:02.082950 572 fs.go:139] Filesystem UUIDs: map[d358f4da-8808-4f6c-8b8a-fcc647914e7d:/dev/sda1]
I1213 19:31:02.082962 572 fs.go:140] Filesystem partitions: map[shm:{mountpoint:/dev/shm major:0 minor:18 fsType:tmpfs blockSiz
e:0} tmpfs:{mountpoint:/run major:0 minor:15 fsType:tmpfs blockSize:0} /dev/sda1:{mountpoint:/var/lib major:8 minor:1 fsType:ext4 b
lockSize:0}]
I1213 19:31:02.091136 572 info.go:51] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"
I1213 19:31:02.093083 572 manager.go:216] Machine: {NumCores:1 CpuFrequency:3400022 MemoryCapacity:4574240768 HugePages:[{PageS
ize:2048 NumPages:0}] MachineID: SystemUUID:D740F169-85BB-4FD2-9F1E-A81CED65D3FD BootID:799bdd7f-4d34-4d88-9189-c376c589bf85 Filesy
stems:[{Device:shm DeviceMajor:0 DeviceMinor:18 Capacity:2287120384 Type:vfs Inodes:558379 HasInodes:true} {Device:tmpfs DeviceMajo
r:0 DeviceMinor:15 Capacity:457424896 Type:vfs Inodes:558379 HasInodes:true} {Device:/dev/sda1 DeviceMajor:8 DeviceMinor:1 Capacity
:8386961408 Type:vfs Inodes:524288 HasInodes:true} {Device:overlay DeviceMajor:0 DeviceMinor:49 Capacity:2287120384 Type:vfs Inodes
:558379 HasInodes:true}] DiskMap:map[43:0:{Name:nbd0 Major:43 Minor:0 Size:0 Scheduler:none} 43:3:{Name:nbd3 Major:43 Minor:3 Size:
0 Scheduler:none} 43:5:{Name:nbd5 Major:43 Minor:5 Size:0 Scheduler:none} 43:9:{Name:nbd9 Major:43 Minor:9 Size:0 Scheduler:none} 8
:0:{Name:sda Major:8 Minor:0 Size:8589934592 Scheduler:deadline} 43:15:{Name:nbd15 Major:43 Minor:15 Size:0 Scheduler:none} 43:2:{N
ame:nbd2 Major:43 Minor:2 Size:0 Scheduler:none} 43:7:{Name:nbd7 Major:43 Minor:7 Size:0 Scheduler:none} 43:8:{Name:nbd8 Major:43 M
inor:8 Size:0 Scheduler:none} 43:10:{Name:nbd10 Major:43 Minor:10 Size:0 Scheduler:none} 43:4:{Name:nbd4 Major:43 Minor:4 Size:0 Sc
heduler:none} 43:12:{Name:nbd12 Major:43 Minor:12 Size:0 Scheduler:none} 43:13:{Name:nbd13 Major:43 Minor:13 Size:0 Scheduler:none}
43:14:{Name:nbd14 Major:43 Minor:14 Size:0 Scheduler:none} 43:6:{Name:nbd6 Major:43 Minor:6 Size:0 Scheduler:none} 43:1:{Name:nbd1
Major:43 Minor:1 Size:0 Scheduler:none} 43:11:{Name:nbd11 Major:43 Minor:11 Size:0 Scheduler:none}] NetworkDevices:[{Name:eth0 Mac
Address:82:3f:cf:43:6d:fc Speed:-1 Mtu:1500} {Name:ip6tnl0 MacAddress:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00 Speed:0 Mtu:1
452} {Name:tunl0 MacAddress:00:00:00:00 Speed:0 Mtu:1480}] Topology:[{Id:0 Memory:0 Cores:[{Id:0 Threads:[0] Caches:[{Size:32768 Ty
pe:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:4194304 Type:Unified Level:2}]}] Caches:[{Size:16777216 Type:Unified L
evel:3}]}] CloudProvider:Unknown InstanceType:Unknown InstanceID:None}
I1213 19:31:02.099803 572 manager.go:222] Version: {KernelVersion:4.9.62-linuxkit ContainerOsVersion:LinuxKit Kubernetes Projec
t DockerVersion:Unknown DockerAPIVersion:Unknown CadvisorVersion: CadvisorRevision:}
I1213 19:31:02.122241 572 container_manager_linux.go:257] Creating Container Manager object based on Node Config: {RuntimeCgrou
psName: SystemCgroupsName: KubeletCgroupsName: ContainerRuntime:remote CgroupsPerQOS:false CgroupRoot: CgroupDriver:cgroupfs Protec
tKernelDefaults:false NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: EnforceNodeAllocatable:map[] KubeRes
erved:map[] SystemReserved:map[] HardEvictionThresholds:[{Signal:memory.available Operator:LessThan Value:{Quantity:100Mi Percentag
e:0} GracePeriod:0s MinReclaim:<nil>} {Signal:nodefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.1} GracePeriod:
0s MinReclaim:<nil>} {Signal:nodefs.inodesFree Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<
nil>}]} ExperimentalQOSReserved:map[] ExperimentalCPUManagerPolicy:none ExperimentalCPUManagerReconcilePeriod:10s}
I1213 19:31:02.124036 572 container_manager_linux.go:288] Creating device plugin handler: false
I1213 19:31:02.124167 572 kubelet.go:273] Adding manifest file: /etc/kubernetes/manifests
I1213 19:31:02.124211 572 kubelet.go:283] Watching apiserver
E1213 19:31:02.127909 572 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:422: Failed to list *v1.Node: Get https://
10.10.10.127:6443/api/v1/nodes?fieldSelector=metadata.name%3Dlinuxkit-823fcf436dfc&resourceVersion=0: dial tcp 10.10.10.127:6443: g
etsockopt: connection refused
E1213 19:31:02.167876 572 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:413: Failed to list *v1.Service: Get https
://10.10.10.127:6443/api/v1/services?resourceVersion=0: dial tcp 10.10.10.127:6443: getsockopt: connection refused
E1213 19:31:02.167939 572 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get h
ttps://10.10.10.127:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dlinuxkit-823fcf436dfc&resourceVersion=0: dial tcp 10.10.10.127:6
443: getsockopt: connection refused
W1213 19:31:02.260304 572 kubelet_network.go:62] Hairpin mode set to "promiscuous-bridge" but container runtime is "remote", ig
noring
I1213 19:31:02.263340 572 kubelet.go:517] Hairpin mode set to "none"
I1213 19:31:02.263547 572 remote_runtime.go:43] Connecting to runtime service unix:///var/run/cri-containerd.sock
2017/12/13 19:31:02 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix
/var/run/cri-containerd.sock: connect: no such file or directory"; Reconnecting to {/var/run/cri-containerd.sock <nil>}
E1213 19:31:02.282627 572 remote_runtime.go:69] Version from runtime service failed: rpc error: code = Unavailable desc = grpc:
the connection is unavailable
E1213 19:31:02.282857 572 kuberuntime_manager.go:160] Get runtime version failed: rpc error: code = Unavailable desc = grpc: th
e connection is unavailable
error: failed to run Kubelet: failed to create kubelet: rpc error: code = Unavailable desc = grpc: the connection is unavailable
cri-containerd
I1213 19:31:00.959698 535 cri_containerd.go:100] Run cri-containerd &{Config:{ContainerdConfig:{RootDir:/var/lib/containerd Sna
pshotter:overlayfs Endpoint:/run/containerd/containerd.sock Runtime:io.containerd.runtime.v1.linux RuntimeEngine: RuntimeRoot:} Cni
Config:{NetworkPluginBinDir:/var/lib/cni/opt/bin NetworkPluginConfDir:/var/lib/cni/etc/net.d} SocketPath:/var/run/cri-containerd.so
ck RootDir:/var/lib/cri-containerd StreamServerAddress: StreamServerPort:10010 CgroupPath: EnableSelinux:false SandboxImage:gcr.io/
google_containers/pause:3.0 StatsCollectPeriod:10 SystemdCgroup:false OOMScore:-999 EnableProfiling:true ProfilingPort:10011 Profil
ingAddress:127.0.0.1} ConfigFilePath:/etc/cri-containerd/config.toml}
I1213 19:31:00.961008 535 cri_containerd.go:104] Start profiling server
I1213 19:31:00.961020 535 cri_containerd.go:108] Run cri-containerd grpc server on socket "/var/run/cri-containerd.sock"
I1213 19:31:00.966705 535 service.go:155] Get device uuid "d358f4da-8808-4f6c-8b8a-fcc647914e7d" for image filesystem "/var/lib
/containerd/io.containerd.snapshotter.v1.overlayfs"
time="2017-12-13T19:31:00Z" level=info msg="CNI network weave (type=weave-net) is used from /var/lib/cni/etc/net.d/10-weave.conf"
time="2017-12-13T19:31:00Z" level=info msg="CNI network weave (type=weave-net) is used from /var/lib/cni/etc/net.d/10-weave.conf"
I1213 19:31:00.977500 535 service.go:182] Start cri-containerd service
I1213 19:31:00.977526 535 service.go:184] Start recovering state
I1213 19:31:02.346005 535 service.go:190] Start event monitor
I1213 19:31:02.346016 535 service.go:194] Start snapshots syncer
I1213 19:31:02.346028 535 service.go:203] Start streaming server
I1213 19:31:02.346032 535 service.go:214] Start grpc server
I1213 19:31:02.346127 535 events.go:94] TaskExit event &TaskExit{ContainerID:rngd,ID:rngd,Pid:636,ExitStatus:1,ExitedAt:2017-12
-13 19:31:01.834075925 +0000 UTC,}
E1213 19:31:02.346164 535 events.go:100] Failed to get container "rngd": does not exist
I1213 19:31:02.347662 535 events.go:94] TaskExit event &TaskExit{ContainerID:kubelet,ID:kubelet,Pid:572,ExitStatus:1,ExitedAt:2
017-12-13 19:31:02.300502105 +0000 UTC,}
E1213 19:31:02.347679 535 events.go:100] Failed to get container "kubelet": does not exist
Describe the results you expected: running kubelet
Additional information you deem important (e.g. issue happens only occasionally):
I could reproduce this back to at least f9a2a31
ctr t delete kubelet && ctr t start kubelet
starts successfully. The risk that kubelet starts before var/run/cri-containerd.sock
is there seems much higher without kubeadm after first boot.
Does this still happen with Kube 1.9 (just merged via #33)? I think I saw some improvements in this area.
If not then please open an issue against kubernetes or cri-containerd (whichever seems more appropriate, and assuming there isn't already one) since they should be more robust at startup time. My guy feeling is that it is kubernetes which should be more robust to waiting for CRI than vice versa, so a kube issue would seem the way to go.
I just built and booted current master (d39e6ba85fa46a19e77668e436c9e02da8c03850), booted and ran kubeadm-init.sh
waited for the pods to all come up then powered off with poweroff
. I then booted again with the same persistent disk and again waited for the pods (they did).
I repeated the poweroff
, boot and check 5 times and was successful each time (compared with your previous <10% success rate). I also did one iteration with poweroff -f
and one with reboot
for good measure, still no failures.
So either this is fixed in 1.9.0 or the odds of hitting the problem have changed dramatically, or perhaps something is different in your environment (perhaps just timings).
I tried restarting the master 3 times with no problems but could reproduce it after setting up some nodes and pods :/. From what I know Systemd usually handles Socket Activation.
LinuxKit doesn't use systemd.
As said in https://github.com/linuxkit/kubernetes/issues/26#issuecomment-352443648 I think this is an upstream issue.
Sure, but the upstream implementations use systemd which does socket activation. I will check if i can solve it with manually waiting until the socket is there.