for-azure UCP Not showing accurate disk usage

Expected behavior

UCP should have accurate indication of worker disk usage

Actual behavior

Worker disk appears full despite UCP reporting available space

Information

Full output of the diagnostics from "docker-diagnose" ran from one of the instance

OK hostname=swarm-manager000000 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-manager000001 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-manager000002 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-worker000000 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-worker000001 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-worker000002 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-worker000003 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-worker000004 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
Done requesting diagnostics.
Your diagnostics session ID is 1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
Please provide this session ID to the maintainer debugging your issue.

Steps to reproduce the behavior

Spin up docker cluster using beta template from https://github.com/docker/for-azure/issues/38 (worker instances are D3_V2)
Deploy a number of services (accumulated worker images are about 14GB)
Service deployments begin to fail with "No such image:<image-name>"
Verify Image exists in DTR and is pullable
Log on to worker and attempt to pull image ( ~200MB image )

swarm-worker000003:~$ docker pull <image-name>: Pulling from <repo>
6d987f6f4279: Already exists 
d0e8a23136b3: Already exists 
5ad5b12a980e: Already exists 
275352573fee: Pull complete 
ffbeb13b7578: Pull complete 
027bb24d721d: Pull complete 
aa04d7355dfa: Extracting [==================================================>]  45.51MB/45.51MB
failed to register layer: Error processing tar file(exit status 1): mkdir /app/node_modules/@types/lodash/gt: no space left on device

Check disk space from worker

swarm-worker000003:~$ df -h
Filesystem                Size      Used Available Use% Mounted on
overlay                  29.4G     17.5G     10.4G  63% /
tmpfs                     6.8G      4.0K      6.8G   0% /dev
tmpfs                     6.8G         0      6.8G   0% /sys/fs/cgroup
tmpfs                     6.8G    161.4M      6.7G   2% /etc
/dev/sda1                29.4G     17.5G     10.4G  63% /home
tmpfs                     6.8G    161.4M      6.7G   2% /mnt
shm                       6.8G         0      6.8G   0% /dev/shm
tmpfs                     6.8G    161.4M      6.7G   2% /lib/firmware
/dev/sda1                29.4G     17.5G     10.4G  63% /var/log
/dev/sda1                29.4G     17.5G     10.4G  63% /etc/ssh
tmpfs                     6.8G    161.4M      6.7G   2% /lib/modules
/dev/sda1                29.4G     17.5G     10.4G  63% /etc/hosts
/dev/sda1                29.4G     17.5G     10.4G  63% /var/etc/hostname
/dev/sda1                29.4G     17.5G     10.4G  63% /etc/resolv.conf
/dev/sda1                29.4G     17.5G     10.4G  63% /var/etc/docker
tmpfs                     1.4G      1.3M      1.4G   0% /var/run/docker.sock
/dev/sda1                29.4G     17.5G     10.4G  63% /var/lib/waagent
tmpfs                     6.8G    161.4M      6.7G   2% /usr/local/bin/docker
/dev/sdb1               200.0G    119.0M    199.9G   0% /mnt/resource

Check UCP Dashboard

The fact that the disk is full at all with only 14GB of data seems likely related to #19, #29 But unlike when we experienced #38 There was no indication from the dashboard (or even from the worker instance container itself) that some underlying storage resource was full (see df output above)

Oct 20 '17 00:10 ghost

@ddebroy some additional information - We are exhausting Inodes as shown below.

swarm-worker000003:~$ df -i
Filesystem              Inodes      Used Available Use% Mounted on
overlay                1966080   1960215      5865 100% /
tmpfs                  1792091       186   1791905   0% /dev
tmpfs                  1792091        15   1792076   0% /sys/fs/cgroup
tmpfs                  1792091      1884   1790207   0% /etc
/dev/sda1              1966080   1960215      5865 100% /home
tmpfs                  1792091      1884   1790207   0% /mnt
shm                    1792091         1   1792090   0% /dev/shm
tmpfs                  1792091      1884   1790207   0% /lib/firmware
/dev/sda1              1966080   1960215      5865 100% /var/log
/dev/sda1              1966080   1960215      5865 100% /etc/ssh
tmpfs                  1792091      1884   1790207   0% /lib/modules
/dev/sda1              1966080   1960215      5865 100% /etc/hosts
/dev/sda1              1966080   1960215      5865 100% /var/etc/hostname
/dev/sda1              1966080   1960215      5865 100% /etc/resolv.conf
/dev/sda1              1966080   1960215      5865 100% /var/etc/docker
tmpfs                  1792091       376   1791715   0% /var/run/docker.sock
/dev/sda1              1966080   1960215      5865 100% /var/lib/waagent
tmpfs                  1792091      1884   1790207   0% /usr/local/bin/docker
/dev/sdb1                  256        27       229  11% /mnt/resource

Based on https://github.com/moby/moby/issues/10613 we ran docker rmi $(docker images -q --filter "dangling=true") and this took inodes used down to 21%.

Oct 23 '17 18:10 spanditcaa

Seems like something is off with the VHD used by the template I pointed to earlier: it is not mounting /dev/sdb correctly. Will update with more findings.

Oct 23 '17 23:10 ddebroy

Thanks @ddebroy

Oct 24 '17 00:10 spanditcaa

Update: It turns out the template I referred to earlier https://download.docker.com/azure/17.06/17.06.2/Docker-DDC.tmpl points to VHD 1.0.9 which did not incorporate the enhancement to use the second larger sdb disk provisioned by Azure to mount /var/lib/docker. That enhancement is first being rolled out in the CE version and then based on how things are going, will roll be rolled out as part of the next EE release: 17.06.2-ee4.

Re-reading the above, it sounds like the df -h output was in sync with what UCP was reporting but the problem was the i-node exhaustion which docker rmi took care of, correct?

Oct 24 '17 19:10 ddebroy

Yes @ddebroy, but we ended up in a bad state between managers and workers - similar to what is described here: https://github.com/docker/swarm/issues/2044

Although we could pull down images after the docker rmi, the tasks wouldn't advance past 'assigned' state, and the workers were logging the following:

Not enough managers yet. We only have 0 and we need 3 to continue.
sleep for a bit, and try again when we wake up.

We tried provisioning a new worker (which failed to connect) and restarting the ucp agent and controller, to no avail.

At this point, we deleted the cluster again and may wait for 17.06.2-ee4. Is there an expected release date ?

Oct 24 '17 21:10 spanditcaa

Hmm .. I am not sure of the steps you took but a worker will never log the message

Not enough managers yet. We only have 0 and we need 3 to continue.
sleep for a bit, and try again when we wake up.

It is something a new manager logs when it is unable to join the swarm. Sounds like you were trying to bring up new manager nodes? Looking through your diagnostics logs from the initial message, the swarm appears to be in a stable state. I guess the swarm cluster ended up in a bad state once the inode issue appeared.

By any chance, is there a way, you can share steps to repro step (2) above in a manner as close to what you tried as possible: Deploy a number of services (accumulated worker images are about 14GB) that will allow us to repro your environment internally and investigate?

Regarding 17.06.2-ee-4: we are running into some delays with getting the VHDs (that work the way we want with 17.06.2-ee-4) published through Azure. Will update once that is done and we are ready.

Oct 25 '17 19:10 ddebroy

Sure - it is probably a side effect of node.js applications, where we have thousands of tiny files that make up the application. I'll see if I can locate a suitable example, otherwise I'll publish a sample for you that triggers the issue.

Oct 25 '17 19:10 spanditcaa

@ddebroy - @jeffnessen mentioned he has a suitable test container for you.

Oct 27 '17 19:10 spanditcaa

for-azure for-azure copied to clipboard

UCP Not showing accurate disk usage

Expected behavior

Actual behavior

Information

Steps to reproduce the behavior

for-azure
for-azure copied to clipboard