dcos-e2e There was an unknown error when performing a doctor check.

There was an unknown error when performing a doctor check.

Open Spirit328 opened this issue 5 years ago • 5 comments

After I installed minidcos on my most actual CentOS 7.5
(uname -a: Linux <server> 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux) with the command for the Linux package: sudo curl --fail -L https://github.com/dcos/dcos-e2e/releases/download/2018.12.10.0/minidcos -o /usr/local/bin/minidcos && sudo chmod +x /usr/local/bin/minidcos
I also changed the owner of the directory and the file with: sudo chown myuser:myuser ... as I have to use sudo on my sytems to install packages. After installation has been finished, I executed the command minidcos docker doctor -v
All checks went well until 13/13. This gave me the following error:

Note: Docker has approximately 31.5 GB of memory available. The amount of memory required depends on the workload. For example, creating large clusters or multiple clusters requires a lot of memory.
A four node cluster seems to work well on a machine with 9 GB of memory available to Docker.
12/13 checks complete:        
2018-12-23 17:54:22 ERROR    dcos_e2e._common | docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
ERROR:dcos_e2e._common:docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
2018-12-23 17:54:22 ERROR    dcos_e2e._common | See 'docker run --help'.
ERROR:dcos_e2e._common:See 'docker run --help'.

Error: There was an unknown error when performing a doctor check.
The doctor function was "_check_can_mount_in_docker".
The error was: "Command '['docker', 'exec', '--user', 'root', '--interactive', '53adb21fa73f33c64b2d63ab63b28e97096a251f2c25d719cf4944b44d54979b', 'docker', 'run', '-v', '/foo', 'alpine']' returned non-zero exit status 125.".
12/13 checks complete: Exception ignored in: <bound method tqdm.__del__ of 12/13 checks complete: ▏        >
Traceback (most recent call last):
  File "site-packages/tqdm/_tqdm.py", line 931, in __del__
  File "site-packages/tqdm/_tqdm.py", line 1133, in close
  File "site-packages/tqdm/_tqdm.py", line 496, in _decr_instances
  File "site-packages/tqdm/_monitor.py", line 52, in exit
  File "threading.py", line 1053, in join

Actually the docker daemon is up and running:

$ sudo systemctl status docker
docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor prese                                                                                                  t: disabled)
   Active: active (running) since So 2018-12-23 17:32:45 CET; 18min ago
     Docs: https://docs.docker.com
 Main PID: 15404 (dockerd)
   CGroup: /system.slice/docker.service
           ├─15404 /usr/bin/dockerd -H unix://
           ├─15427 containerd --config /var/run/docker/containerd/containerd....
           ├─16403 containerd-shim -namespace moby -workdir /var/lib/docker/c...
           ├─24416 containerd-shim -namespace moby -workdir /var/lib/docker/c...
           ├─24515 containerd-shim -namespace moby -workdir /var/lib/docker/c...
           ├─25318 containerd-shim -namespace moby -workdir /var/lib/docker/c...
           ├─26123 containerd-shim -namespace moby -workdir /var/lib/docker/c...
           └─26898 runc --root /var/run/docker/runtime-runc/moby --log /run/d...

Dez 23 17:50:25 little dockerd[15404]: time="2018-12-23T17:50:25.352687757+...1d
Dez 23 17:50:25 little dockerd[15404]: time="2018-12-23T17:50:25.362184302+...e"
Dez 23 17:50:26 little dockerd[15404]: time="2018-12-23T17:50:26.522587501+...58
Dez 23 17:50:37 little dockerd[15404]: time="2018-12-23T17:50:37.994556673+...e"
Dez 23 17:50:38 little dockerd[15404]: time="2018-12-23T17:50:38.413301781+...7b
Dez 23 17:50:38 little dockerd[15404]: time="2018-12-23T17:50:38.422892823+...e"
Dez 23 17:50:39 little dockerd[15404]: time="2018-12-23T17:50:39.389365804+...16
Dez 23 17:50:42 little dockerd[15404]: time="2018-12-23T17:50:42.570627795+...15
Dez 23 17:50:50 little dockerd[15404]: time="2018-12-23T17:50:50.697512712+...18
Dez 23 17:50:59 little dockerd[15404]: time="2018-12-23T17:50:58.995491606+...23
Hint: Some lines were ellipsized, use -l to show in full.

I tried the same with starting the docker daemon manually:

$ sudo dockerd
INFO[2018-12-23T19:16:39.403460151+01:00] parsed scheme: "unix"                         module=grpc
INFO[2018-12-23T19:16:39.404589503+01:00] scheme "unix" not registered, fallback to default scheme  module=grpc
INFO[2018-12-23T19:16:39.404830250+01:00] parsed scheme: "unix"                         module=grpc
INFO[2018-12-23T19:16:39.404916296+01:00] scheme "unix" not registered, fallback to default scheme  module=grpc
INFO[2018-12-23T19:16:39.405424749+01:00] ccResolverWrapper: sending new addresses to cc: [{unix:///run/containerd/containerd.sock 0  <nil>}]  module=grpc
INFO[2018-12-23T19:16:39.405605361+01:00] ClientConn switching balancer to "pick_first"  module=grpc
INFO[2018-12-23T19:16:39.405818241+01:00] pickfirstBalancer: HandleSubConnStateChange: 0xc42076cad0, CONNECTING  module=grpc
INFO[2018-12-23T19:16:39.407404153+01:00] ccResolverWrapper: sending new addresses to cc: [{unix:///run/containerd/containerd.sock 0  <nil>}]  module=grpc
INFO[2018-12-23T19:16:39.410009554+01:00] ClientConn switching balancer to "pick_first"  module=grpc
INFO[2018-12-23T19:16:39.410287038+01:00] pickfirstBalancer: HandleSubConnStateChange: 0xc420886190, CONNECTING  module=grpc
INFO[2018-12-23T19:16:39.408256998+01:00] pickfirstBalancer: HandleSubConnStateChange: 0xc42076cad0, READY  module=grpc
INFO[2018-12-23T19:16:39.411416250+01:00] pickfirstBalancer: HandleSubConnStateChange: 0xc420886190, READY  module=grpc
INFO[2018-12-23T19:16:39.462749589+01:00] [graphdriver] using prior storage driver: overlay2
INFO[2018-12-23T19:16:39.545900130+01:00] Graph migration to content-addressability took 0.00 seconds
INFO[2018-12-23T19:16:39.549423401+01:00] Loading containers: start.
INFO[2018-12-23T19:16:41.842242207+01:00] Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address
INFO[2018-12-23T19:16:42.713012048+01:00] Loading containers: done.
INFO[2018-12-23T19:16:42.831660976+01:00] Docker daemon                                 commit=4d60db4 graphdriver(s)=overlay2 version=18.09.0
INFO[2018-12-23T19:16:42.832064665+01:00] Daemon has completed initialization
INFO[2018-12-23T19:16:42.886521566+01:00] API listen on /var/run/docker.sock

Then I retried:

$ pwd
/usr/local/bin
[myuser@server bin]$ ./minidcos docker doctor

on "dockerd" terminal it said:

INFO[2018-12-23T19:17:47.437329516+01:00] Container 6faf2e38300319c5fe2c18ef8758d771ad496ac1930e96fbaa32870894bfadf0 failed to exit within 10 seconds of signal 15 - using the force
INFO[2018-12-23T19:17:47.899840140+01:00] ignoring event                                module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
INFO[2018-12-23T19:18:00.165255179+01:00] Container 6171b37f290a7265cbae3e1cf388da5aa88269b5cd8bb58df867a05f5af8203f failed to exit within 10 seconds of signal 15 - using the force
INFO[2018-12-23T19:18:00.614194441+01:00] ignoring event                                module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
INFO[2018-12-23T19:18:12.738779801+01:00] Container faccf28f1779cfb1bc814dd836b58c238e44228458d6f9a40d1c746734009c2f failed to exit within 10 seconds of signal 15 - using the force
INFO[2018-12-23T19:18:13.189271409+01:00] ignoring event                                module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
INFO[2018-12-23T19:18:25.333069615+01:00] Container c85462287ccb5f9063ff351d26043cd8b9c6d7037583f7c2359bfb8c76b1742a failed to exit within 10 seconds of signal 15 - using the force
INFO[2018-12-23T19:18:25.782169752+01:00] ignoring event                                module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
INFO[2018-12-23T19:18:38.035307969+01:00] Container 0831d004a297ef2377b6272b199bd98114db1ee157dd98c671d153b14c93b5e9 failed to exit within 10 seconds of signal 15 - using the force
INFO[2018-12-23T19:18:38.498353545+01:00] ignoring event                                module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
INFO[2018-12-23T19:19:07.051312557+01:00] ignoring event                                module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
INFO[2018-12-23T19:19:08.542774254+01:00] ignoring event                                module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
INFO[2018-12-23T19:19:09.774780657+01:00] ignoring event                                module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
INFO[2018-12-23T19:19:36.829187350+01:00] ignoring event                                module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
INFO[2018-12-23T19:19:37.939741998+01:00] ignoring event                                module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
INFO[2018-12-23T19:19:39.307944482+01:00] ignoring event                                module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"

and the output in the "minidcos" terminal still was:

Note: Docker has approximately 31.5 GB of memory available. The amount of memory required depends on the workload.  memory.
A four node cluster seems to work well on a machine with 9 GB of memory available to Docker.
12/13 checks complete:   
2018-12-23 19:19:36 ERROR    dcos_e2e._common | docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
ERROR:dcos_e2e._common:docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
2018-12-23 19:19:36 ERROR    dcos_e2e._common | See 'docker run --help'.
ERROR:dcos_e2e._common:See 'docker run --help'.

Error: There was an unknown error when performing a doctor check.
The doctor function was "_check_can_mount_in_docker".
The error was: "Command '['docker', 'exec', '--user', 'root', '--interactive', '20425e8f106830d9b7b4157b434fe698af42ac67a9f7a6d95a704fb924a71fe5', 'docker', 'run', '-v', '/foo', 'alpine']' returned non-zero exit status 125.".
12/13 checks complete:  Exception ignored in: <bound method tqdm.__del__ of 12/13 checks complete: >
Traceback (most recent call last):
  File "site-packages/tqdm/_tqdm.py", line 931, in __del__
  File "site-packages/tqdm/_tqdm.py", line 1133, in close
  File "site-packages/tqdm/_tqdm.py", line 496, in _decr_instances
  File "site-packages/tqdm/_monitor.py", line 52, in exit
  File "threading.py", line 1053, in join

But a:

$ ls -l /var/run/docker.sock
srw-rw----. 1 root docker 0 23. Dez 19:16 /var/run/docker.sock

seems to proof, that the docker socket is available.

The current user is member of the usergroup "docker" I can execute any docker command without "sudo" upfront. I tried the same commands while being root, but nothing changed.

P.S.:

$ docker version
Client:
 Version:           18.09.0
 API version:       1.39
 Go version:        go1.10.4
 Git commit:        4d60db4
 Built:             Wed Nov  7 00:48:22 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.0
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.4
  Git commit:       4d60db4
  Built:            Wed Nov  7 00:19:08 2018
  OS/Arch:          linux/amd64
  Experimental:     false

Dec 23 '18 18:12 Spirit328

Do we have any updates on this? I suffered this issue in CentOS7 too. Could directly run docker commands, while failed in minidcos docker doctor command. Appreciate if any kindly help, thanks.

Jan 11 '19 08:01 dennyx

I could successfully run create action with specifying docker version, command as: minidcos docker create --docker-version 17.12.1-ce ./dcos_generate_config.sh --agents 0 ref to https://github.com/dcos/dcos-e2e/issues/1252#issuecomment-409343923

although minidcos docker doctor still failed, the cluster seems works well.

full workround to successfully visit mesosphere web ui page, the command should be run with non-root user: minidcos docker create --docker-version 17.12.1-ce ./dcos_generate_config.sh --agents 0 minidcos docker wait --cluster-id default minidcos docker web

Jan 11 '19 09:01 dennyx

Exact same issue with me too.

Host OS: CentOS Linux release 7.6.1810 (Core)
Docker Version: 18.09.1

$ ls -l /var/run/docker.sock 
srw-rw---- 1 root docker 0 ජන   12 10:07 /var/run/docker.sock

minidcos --version
minidcos, version 2019.01.10.0

Just like @dennyx said, I also can create and setup the cluster. But still minidcos docker doctor fails.

It would be great if I can get some help with this?

Update: Tried downgrading docker to docker 17.12.1-ce but no joy!

Jan 12 '19 10:01 anuradhai4i

I cannot get miniDC/OS release 2019.05.03.0 to work on CentOS 7 in Docker. However, I can successfully create a cluster with release 2019.05.23.1, CentOS Linux release 7.6.1810, Docker version 18.09.6. However, I do need to specify the Docker version, and minidcos docker doctor still fails.

To create the cluster:

$ minidcos docker create --variant oss --agents 1 --cluster-id default --docker-version 17.12.1-ce dcos_generate_config.sh
$ minidcos docker wait --cluster-id default

May 23 '19 19:05 casparderksen

Thanks all for your contributions to this thread. It is interesting and it includes multiple issues:

Ugly traceback in an error message

Traceback (most recent call last):
  File "site-packages/tqdm/_tqdm.py", line 931, in __del__
  File "site-packages/tqdm/_tqdm.py", line 1133, in close
  File "site-packages/tqdm/_tqdm.py", line 496, in _decr_instances
  File "site-packages/tqdm/_monitor.py", line 52, in exit
  File "threading.py", line 1053, in join

This traceback should no longer be shown, since an update to tqdm a while back.

What the error means

2018-12-23 19:19:36 ERROR    dcos_e2e._common | docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
ERROR:dcos_e2e._common:docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
2018-12-23 19:19:36 ERROR    dcos_e2e._common | See 'docker run --help'.
ERROR:dcos_e2e._common:See 'docker run --help'.

While folks here have been inspecting their local Docker instances, the error refers to Docker on the minidcos nodes (Docker in Docker).

The requirement to specify --docker-version

In #1574 (released in 2019.06.07.0), I made the change "Changed the default version of Docker installed on minidcos docker clusters to 18.06.3-ce.".

This should mean that folks no longer have to specify --docker-version by default in the newest minidcos.

Next steps

What seems clear is that using Docker version 1.13.1 on nodes is problematic on some machines.

The main pain should be taken away with the update to make 18.06.3-ce the default version. However, the doctor command will still fail.

Ideally we can narrow down exactly what the problem is, and make that clear in the doctor error.

An intermediate step might look for the given error and just move on with a warning, potentially one which links to this issue.

Jun 13 '19 09:06 adamtheturtle

dcos-e2e dcos-e2e copied to clipboard

There was an unknown error when performing a doctor check.

dcos-e2e
dcos-e2e copied to clipboard