ACS icon indicating copy to clipboard operation
ACS copied to clipboard

ACS kubernetes hyperkube flaky

Open guesslin opened this issue 7 years ago • 15 comments

There's lots of hyperkube failed during ACS running.

image

I try to retrieved logs from one of kube-controller-manager.

E0509 22:02:57.817269       1 leaderelection.go:228] error retrieving resource lock kube-system/kube-controller-manager: client: etcd cluster is unavailable or misconfigured
E0509 22:53:05.605381       1 leaderelection.go:228] error retrieving resource lock kube-system/kube-controller-manager: client: etcd cluster is unavailable or misconfigured
E0511 01:39:12.416926       1 leaderelection.go:228] error retrieving resource lock kube-system/kube-controller-manager: client: etcd cluster is unavailable or misconfigured
E0512 07:41:17.543209       1 leaderelection.go:228] error retrieving resource lock kube-system/kube-controller-manager: client: etcd cluster is unavailable or misconfigured

Any solution for this? looks like the etcd maybe the culprit of all events.

guesslin avatar May 12 '17 09:05 guesslin

@amanohar would your etcd changes in ACS engine fix this? If not we will need to look into what is misconfigured.

JackQuincy avatar May 12 '17 16:05 JackQuincy

I talked with @amanohar offline and her changes should be perpendicular to this change. Curious did you reach in and delete or change the masters? Or was this on the original startup? could you share your resource group/cluster resource name? I want to look at logs on my side to try to figure out why this happened.

JackQuincy avatar May 12 '17 18:05 JackQuincy

@JackQuincy I did reach in and change /etc/systemd/system/docker.service.d/overlay.conf config of masters, but later change it back with original startup config.

resource group name: production cluster resource name: production

guesslin avatar May 15 '17 03:05 guesslin

So the cluster looks good from our side(where we make updates to the cluster, we don't have a health check at this time). My guess is somehow changing the config managed to break the etcd cluster. The easiest way to get back up and running is probably to deploy a new cluster and just move your workloads over to that. If you need to recover this cluster I'd look for a etcd TSG.

JackQuincy avatar May 15 '17 18:05 JackQuincy

@JackQuincy FYI I fellow this documents to configure the systemd docker config https://github.com/Microsoft/OMS-docker/blob/master/OlderVersionREADME.md#setting-up

guesslin avatar May 16 '17 09:05 guesslin

Interesting. I know we are running OMS on clusters internally. I'm at a conference today and tomorrow so I'm not going to have much time to look at this until Thursday but will take a look and come back with some suggestions.

JackQuincy avatar May 16 '17 13:05 JackQuincy

After migrate to new cluster (and keep the docker config un-touched), our new cluster keep consuming high CPU usage with fellowing syslog messages.

May 31 06:28:05 k8s-master-1B2AB7CF-0 docker[5594]: E0531 06:28:05.571248    6086 fsHandler.go:121] failed to collect filesystem stats - rootDiskErr: du command failed on /var/lib/docker/overlay/aec50304ba5199da9ab1d23828e8a1bace0ff6d30322d765cdafef9f57203425 with output stdout: 674660#011/var/lib/docker/overlay/aec50304ba5199da9ab1d23828e8a1bace0ff6d30322d765cdafef9f57203425
May 31 06:28:05 k8s-master-1B2AB7CF-0 docker[5594]: , stderr: du: cannot access '/var/lib/docker/overlay/aec50304ba5199da9ab1d23828e8a1bace0ff6d30322d765cdafef9f57203425/merged/proc/21453': No such file or directory
May 31 06:28:05 k8s-master-1B2AB7CF-0 docker[5594]: du: cannot access '/var/lib/docker/overlay/aec50304ba5199da9ab1d23828e8a1bace0ff6d30322d765cdafef9f57203425/merged/proc/22183/task/22183/fd/4': No such file or directory
May 31 06:28:05 k8s-master-1B2AB7CF-0 docker[5594]: du: cannot access '/var/lib/docker/overlay/aec50304ba5199da9ab1d23828e8a1bace0ff6d30322d765cdafef9f57203425/merged/proc/22183/task/22183/fdinfo/4': No such file or directory
May 31 06:28:05 k8s-master-1B2AB7CF-0 docker[5594]: du: cannot access '/var/lib/docker/overlay/aec50304ba5199da9ab1d23828e8a1bace0ff6d30322d765cdafef9f57203425/merged/proc/22183/fd/3': No such file or directory
May 31 06:28:05 k8s-master-1B2AB7CF-0 docker[5594]: du: cannot access '/var/lib/docker/overlay/aec50304ba5199da9ab1d23828e8a1bace0ff6d30322d765cdafef9f57203425/merged/proc/22183/fdinfo/3': No such file or directory
May 31 06:28:05 k8s-master-1B2AB7CF-0 docker[5594]:  - exit status 1, rootInodeErr: <nil>, extraDiskErr: <nil>

and

6086 conversion.go:134] failed to handle multiple devices for container. Skipping Filesystem stats

I found some similar problem like https://github.com/kubernetes-incubator/kargo/issues/1187 and check with etcd version in master

$ etcd --version
etcd Version: 2.2.5
Git SHA: Not provided (use ./build instead of go build)
Go Version: go1.6.2
Go OS/Arch: linux/amd64

Any suggestion?

guesslin avatar May 31 '17 09:05 guesslin

@weinong any idea about the logging and why these things wouldn't be working. I'm guessing the second one is because they aren't using json-file logging so those files don't exist. But how have you done ACS in the past?

JackQuincy avatar May 31 '17 20:05 JackQuincy

@JackQuincy i am the manager of @guesslin and frankly i am quite concerned about the situation. I guess you understand that we can not set up a new cluster every two weeks.

The number of CPU alerts are increasing and I wonder how long we can keep on running. We didn't modify the cluster and from my understanding ACS is meant as a service, right? We just want to run our containers and not deal with the administration of Kubernetes.

What do you suggest to do?

marcusschiesser avatar Jun 01 '17 03:06 marcusschiesser

I honestly am not an expert here. I know @weinong has set something like this up before. @seanknox any suggestions on what might being going on here and how to fix it?

JackQuincy avatar Jun 01 '17 05:06 JackQuincy

@guesslin @marcusschiesser How did you provision the Kubernetes cluster? Are you using acs-engine (the tool provided by https://github.com/Azure/acs-engine) to build them?

seanknox avatar Jun 01 '17 05:06 seanknox

We've been using: az acs create --orchestrator-type=kubernetes -g production -n prod -l eastus --master-count 3 --agent-count 4 --ssh-key-value ~/.ssh/azure_prod.pub On Thu, 1 Jun 2017 at 1:45 PM Sean Knox [email protected] wrote:

@guesslin https://github.com/guesslin @marcusschiesser https://github.com/marcusschiesser How did you provision the Kubernetes cluster? Are you using acs-engine (the tool provided by this repository) to build them?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/Azure/ACS/issues/23#issuecomment-305394814, or mute the thread https://github.com/notifications/unsubscribe-auth/AABC5tJ4cQmnSgA_pZIVt9Fymf6MKOj4ks5r_k_ggaJpZM4NZC8t .

marcusschiesser avatar Jun 01 '17 05:06 marcusschiesser

These errors appear unrelated to Kubernetes functioning:

May 31 06:28:05 k8s-master-1B2AB7CF-0 docker[5594]: E0531 06:28:05.571248 6086 fsHandler.go:121] failed to collect filesystem stats - rootDiskErr: du command failed on /var/lib/docker/overlay/aec50304ba5199da9ab1d23828e8a1bace0ff6d30322d765cdafef9f57203425 with output stdout: 674660#011/var/lib/docker/overlay/aec50304ba5199da9ab1d23828e8a1bace0ff6d30322d765cdafef9f57203425 May 31 06:28:05 k8s-master-1B2AB7CF-0 docker[5594]: , stderr: du: cannot access '/var/lib/docker/overlay/aec50304ba5199da9ab1d23828e8a1bace0ff6d30322d765cdafef9f57203425/merged/proc/21453': No such file or directory

Basically, the Docker daemon appears to be trying to access a container that no longer exists. My guess is there is a stats process (probably heapster) running in the background causing this. In any case, I believe it's harmless.

On a cluster, run etcdctl cluster-health to check the health of your etcd cluster. FWIW, I created a new cluster with 3 masters using az acs and don't see anything out of the ordinary.

seanknox avatar Jun 01 '17 06:06 seanknox

@seanknox thanks. Good to know that this is a non-issue. I checked the health of the etcd cluster, it returns cluster is healthy so that is fine, too.

Our main problem still exists though:

We set up a fresh cluster using az acs. After around two weeks, the dockerd process on the k8s-master nodes periodically takes >90%: image

This happens without any increase of load on the cluster. We only found the error message above so we thought it is related.

marcusschiesser avatar Jun 01 '17 08:06 marcusschiesser

@guesslin @marcusschiesser I was talking to @anhowe and he reminded me that @lachie83 has worked with some customers that hit this issue. @lachie83 Do you remeber what the resolution was when customers were seeing issues with:

etcd cluster is unavailable or misconfigured ?

JackQuincy avatar Jun 01 '17 22:06 JackQuincy