for-azure
for-azure copied to clipboard
UCP Not showing accurate disk usage
Expected behavior
UCP should have accurate indication of worker disk usage
Actual behavior
Worker disk appears full despite UCP reporting available space
Information
- Full output of the diagnostics from "docker-diagnose" ran from one of the instance
OK hostname=swarm-manager000000 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-manager000001 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-manager000002 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-worker000000 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-worker000001 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-worker000002 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-worker000003 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
OK hostname=swarm-worker000004 session=1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
Done requesting diagnostics.
Your diagnostics session ID is 1508455476-xYJctnSfYB8MOH214dEgMMHXyxYPChN7
Please provide this session ID to the maintainer debugging your issue.
Steps to reproduce the behavior
-
Spin up docker cluster using beta template from https://github.com/docker/for-azure/issues/38 (worker instances are D3_V2)
-
Deploy a number of services (accumulated worker images are about 14GB)
-
Service deployments begin to fail with "No such image:<image-name>"
-
Verify Image exists in DTR and is pullable
-
Log on to worker and attempt to pull image ( ~200MB image )
swarm-worker000003:~$ docker pull <image-name>: Pulling from <repo>
6d987f6f4279: Already exists
d0e8a23136b3: Already exists
5ad5b12a980e: Already exists
275352573fee: Pull complete
ffbeb13b7578: Pull complete
027bb24d721d: Pull complete
aa04d7355dfa: Extracting [==================================================>] 45.51MB/45.51MB
failed to register layer: Error processing tar file(exit status 1): mkdir /app/node_modules/@types/lodash/gt: no space left on device
- Check disk space from worker
swarm-worker000003:~$ df -h
Filesystem Size Used Available Use% Mounted on
overlay 29.4G 17.5G 10.4G 63% /
tmpfs 6.8G 4.0K 6.8G 0% /dev
tmpfs 6.8G 0 6.8G 0% /sys/fs/cgroup
tmpfs 6.8G 161.4M 6.7G 2% /etc
/dev/sda1 29.4G 17.5G 10.4G 63% /home
tmpfs 6.8G 161.4M 6.7G 2% /mnt
shm 6.8G 0 6.8G 0% /dev/shm
tmpfs 6.8G 161.4M 6.7G 2% /lib/firmware
/dev/sda1 29.4G 17.5G 10.4G 63% /var/log
/dev/sda1 29.4G 17.5G 10.4G 63% /etc/ssh
tmpfs 6.8G 161.4M 6.7G 2% /lib/modules
/dev/sda1 29.4G 17.5G 10.4G 63% /etc/hosts
/dev/sda1 29.4G 17.5G 10.4G 63% /var/etc/hostname
/dev/sda1 29.4G 17.5G 10.4G 63% /etc/resolv.conf
/dev/sda1 29.4G 17.5G 10.4G 63% /var/etc/docker
tmpfs 1.4G 1.3M 1.4G 0% /var/run/docker.sock
/dev/sda1 29.4G 17.5G 10.4G 63% /var/lib/waagent
tmpfs 6.8G 161.4M 6.7G 2% /usr/local/bin/docker
/dev/sdb1 200.0G 119.0M 199.9G 0% /mnt/resource
- Check UCP Dashboard

The fact that the disk is full at all with only 14GB of data seems likely related to #19, #29
But unlike when we experienced #38 There was no indication from the dashboard (or even from the worker instance container itself) that some underlying storage resource was full (see df output above)
@ddebroy some additional information - We are exhausting Inodes as shown below.
swarm-worker000003:~$ df -i
Filesystem Inodes Used Available Use% Mounted on
overlay 1966080 1960215 5865 100% /
tmpfs 1792091 186 1791905 0% /dev
tmpfs 1792091 15 1792076 0% /sys/fs/cgroup
tmpfs 1792091 1884 1790207 0% /etc
/dev/sda1 1966080 1960215 5865 100% /home
tmpfs 1792091 1884 1790207 0% /mnt
shm 1792091 1 1792090 0% /dev/shm
tmpfs 1792091 1884 1790207 0% /lib/firmware
/dev/sda1 1966080 1960215 5865 100% /var/log
/dev/sda1 1966080 1960215 5865 100% /etc/ssh
tmpfs 1792091 1884 1790207 0% /lib/modules
/dev/sda1 1966080 1960215 5865 100% /etc/hosts
/dev/sda1 1966080 1960215 5865 100% /var/etc/hostname
/dev/sda1 1966080 1960215 5865 100% /etc/resolv.conf
/dev/sda1 1966080 1960215 5865 100% /var/etc/docker
tmpfs 1792091 376 1791715 0% /var/run/docker.sock
/dev/sda1 1966080 1960215 5865 100% /var/lib/waagent
tmpfs 1792091 1884 1790207 0% /usr/local/bin/docker
/dev/sdb1 256 27 229 11% /mnt/resource
Based on https://github.com/moby/moby/issues/10613 we ran docker rmi $(docker images -q --filter "dangling=true") and this took inodes used down to 21%.
Seems like something is off with the VHD used by the template I pointed to earlier: it is not mounting /dev/sdb correctly. Will update with more findings.
Thanks @ddebroy
Update: It turns out the template I referred to earlier https://download.docker.com/azure/17.06/17.06.2/Docker-DDC.tmpl points to VHD 1.0.9 which did not incorporate the enhancement to use the second larger sdb disk provisioned by Azure to mount /var/lib/docker. That enhancement is first being rolled out in the CE version and then based on how things are going, will roll be rolled out as part of the next EE release: 17.06.2-ee4.
Re-reading the above, it sounds like the df -h output was in sync with what UCP was reporting but the problem was the i-node exhaustion which docker rmi took care of, correct?
Yes @ddebroy, but we ended up in a bad state between managers and workers - similar to what is described here: https://github.com/docker/swarm/issues/2044
Although we could pull down images after the docker rmi, the tasks wouldn't advance past 'assigned' state, and the workers were logging the following:
Not enough managers yet. We only have 0 and we need 3 to continue.
sleep for a bit, and try again when we wake up.
We tried provisioning a new worker (which failed to connect) and restarting the ucp agent and controller, to no avail.
At this point, we deleted the cluster again and may wait for 17.06.2-ee4. Is there an expected release date ?
Hmm .. I am not sure of the steps you took but a worker will never log the message
Not enough managers yet. We only have 0 and we need 3 to continue.
sleep for a bit, and try again when we wake up.
It is something a new manager logs when it is unable to join the swarm. Sounds like you were trying to bring up new manager nodes? Looking through your diagnostics logs from the initial message, the swarm appears to be in a stable state. I guess the swarm cluster ended up in a bad state once the inode issue appeared.
By any chance, is there a way, you can share steps to repro step (2) above in a manner as close to what you tried as possible: Deploy a number of services (accumulated worker images are about 14GB) that will allow us to repro your environment internally and investigate?
Regarding 17.06.2-ee-4: we are running into some delays with getting the VHDs (that work the way we want with 17.06.2-ee-4) published through Azure. Will update once that is done and we are ready.
Sure - it is probably a side effect of node.js applications, where we have thousands of tiny files that make up the application. I'll see if I can locate a suitable example, otherwise I'll publish a sample for you that triggers the issue.
@ddebroy - @jeffnessen mentioned he has a suitable test container for you.