for-aws Cloudstor Plugin Disabled in Fresh Stable Install

Expected behavior

Use Docker for AWS 17.06 stable channel to do a fresh install in the Ireland AWS region. Expect to see AWS Cloudstor plugin installed and enabled.

Actual behavior

Plugin is installed but disabled. Enabling it causes the following error:

Error response from daemon: dial unix /run/docker/plugins/038e08d3f72ba67503691db2521ba669711f930df5d4a3afd9664b915038f2de/cloudstor.sock: connect: no such file or directory

Information

Full output of the diagnostics from "docker-diagnose" ran from one of the instance

docker-diagnose produces no output. Does it log to a file?

Steps to reproduce the behavior

Use Docker for AWS 17.06 stable channel to do a fresh install in the Ireland AWS region
docker plugin ls

Side Question

I just want to confirm that the syntax used to talk to an EBS/EFS volume is correct. So this means that Docker has access to create or delete EBS/EFS volumes for me.

version: '3.3'

services:
  artefacts:
    image: sonatype/nexus3
    ports:
      - "8081:8081"
    volumes:
      - artefacts:/nexus-data

volumes:
  artefacts:
    driver: "cloudstor:aws"
    driver_opts:
      backing: relocatable
      ebstype: gp2
      size: 100

Aug 07 '17 08:08 RehanSaeed

@RehanSaeed did you use the template to use an existing VPC or deploy a fresh VPC?

I would like to see the docker logs from the VMs where enablement is failing. When you run docker-diagnose, it should generate output similar to below:

OK hostname=... session=1502125537-PMccxiQyCPy4PvEF3P2FXDiy4hMHqNM7
OK hostname=... session=1502125537-PMccxiQyCPy4PvEF3P2FXDiy4hMHqNM7
.
.
.
Done requesting diagnostics.
Your diagnostics session ID is 1502125537-PMccxiQyCPy4PvEF3P2FXDiy4hMHqNM7
Please provide this session ID to the maintainer debugging your issue.

What do you see when you run docker-diagnose? If docker-diagnose is not working for some reason, can you post the /var/log/docker.log file here from a node where the cloudstor enablement is failing please?

Aug 07 '17 17:08 ddebroy

I'm using an existing VPC. docker-diagnose produces no output on the command line:

~ $ docker-diagnose
~ $

Is there a reason for this? is this another bug? I've attached my log file docker.txt instead.

Aug 08 '17 10:08 RehanSaeed

@RehanSaeed both docker-diagnose not working and the cloudstor issue could be a result of DNS not being properly configured in your existing VPC. Your pre-existing VPCs need to have enableDNSSupport flag set as mentioned in http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-dns.html . In VPC UI, you need to select the Enable DNS hostnames option. Without this, the EFS IP is not reachable by cloudstor and most likely it bails right away. docker-diagnose also depends on DNS to reach the VMs and if that is not configured it will not work. Various other aspects like logging, load balancer integration, etc. are probably also going to fail without DNS being configured in the VPC.

I ran a quick test with the 17.06 template to deploy a fresh VPC in Ireland region and everything came up fine including cloudstor in enabled state.

Aug 08 '17 22:08 ddebroy

@ddebroy I checked my VPC and that option is already enabled:

DNS resolution: yes
DNS hostnames: yes

I did notice that I could not use the machine names of my EC2 instances from inside my containers, I had to refer to them by internal or external IP address. When I'm logged into the Docker host, I don't have this problem. Not sure if this problem is related.

Is there anything else I can look for? Is there a way I can troubleshoot this problem? I'm also going to get my network team to look at this.

Aug 09 '17 08:08 RehanSaeed

I've been debugging your docker-diagnose script. The following curl returns a curl: (52) Empty reply from server response:

curl -s http://10.0.0.100:9024/instances/all/

Calling the following manually:

SESSION="$(date +'%s')-$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64 | tr -d "=+/" | dd bs=32 count=1 2>/dev/null)"
curl -X POST -F "session=${SESSION}" -s http://10.0.0.100:44554/diagnose

Returns the following response from one of the three nodes:

OK hostname=ip-10-0-0-100-bridgeinternationalacademies-com session=1502276255-3tN67Y8V9oy4G3HnYfQny6Ua9uFWgrjk

Aug 09 '17 11:08 RehanSaeed

I've not setup a brand new VPC, verified that it can talk to the DNS server and created a new Docker Swarm using the template with the new VPC. One of the nodes starts up in swarm mode while the other two have swarm mode disabled. One thing we do that might be different is that we use domain controllers in the DHCP Options Set settings. Once again, docker-diagnose does not work, so managed to get the session ID manually instead:

OK hostname=ip-10-2-1-117-bridgeinternationalacademies-com session=1502698488-8ACHicg0bu6InKG8wjpf4i4FrB1h7F47

I also still get the error from the cloudstor plugin. @ddebroy Have you tried creating the template for an existing VPC, does this thing even work?

Aug 14 '17 08:08 RehanSaeed

@RehanSaeed thanks for manually posting the logs! I am now able to access them. Looking through the cloudstor bits, I see the following error when cloudstor tries to initialize and mount the EFS:

Aug 14 07:43:54 moby root: time="2017-08-14T07:43:54Z" level=info msg="time=\"2017-08-14T07:43:54Z\" level=fatal msg=\"Could not mount: mount failed: exit status 1" plugin=759f1fdb8e7661747cedcce30827b0362eee48d3a9955ddd9e89972b35186d34  
Aug 14 07:43:54 moby root: time="2017-08-14T07:43:54Z" level=info msg="output=\"mount: bad address 'fs-179748de.efs.eu-west-1.amazonaws.com'\\n\"\" " plugin=759f1fdb8e7661747cedcce30827b0362eee48d3a9955ddd9e89972b35186d34

It appears the hosts are not able to resolve the EFS mount target: fs-179748de.efs.eu-west-1.amazonaws.com and that is typically due to some form of network/DNS configuration issue in the existing VPC. Once the above step fails, cloudstor init exits with a fatal error leading to the status staying disabled.

Given the above scenario, there are two things you can try:

Figure out why, the EFS mount target is not resolving in your deployments. I am assuming the EFS and the mount targets have been set up by the CFN but is there something in the VPC that is blocking access?
Try to disable the EFS feature when initializing in a region with EFS support (such as Ireland) by selecting the "Create EFS prerequsities for CloudStor?" option and setting it to No. This allows you to use EBS only.

Aug 14 '17 17:08 ddebroy

@RehanSaeed another thing to check if you have a restrictive Security Rule or Firewall running within your VPC that may be blocking various ports by default (including NFS/RPC ones required above). During initial bringup of the swarm, one of the manager swarm nodes becomes the leader while the other nodes query the leader IP for the token on port 9024. In your logs, it appears the other swarm nodes are unable to query the leader on 9024 most likely due to some sort of firewall refusing connections:

wget: can't connect to remote host (10.2.0.245): Connection refused

So note that you are not even ending up with a swarm with the nodes being part of a swarm cluster.

Aug 14 '17 22:08 ddebroy

Checking Security Groups

I checked my VPC's default security group and network ACL's and all inbound and outbound ports are wide open. The Route Table for my VPC's three subnets all have a local route mapped by default and another one for the internet gateway at 0.0.0.0/0. All other security groups have been applied by the cloud formation template to the EC2 instances.

Checking EFS Mount Target

When I ping fs-179748de.efs.eu-west-1.amazonaws.com from a node I get:

ping: bad address 'fs-179748de.efs.eu-west-1.amazonaws.com'

When I do it from my local machine I get:

ping: unknown host fs-179748de.efs.eu-west-1.amazonaws.com

Why does this host name not exist? I am using a local domain controller in my DHCP Options Set instead of the default AmazonProvidedDNS DHCP Options Set. Does the Amazon DNS provider do something special to resolve fs-179748de.efs.eu-west-1.amazonaws.com?

Checking Port 9024

I also logged into each swarm manager node and curl'ed the other nodes on port 9024. I got a response:

curl 10.2.0.245:9024
token/

I can see no reason why two of my three nodes fail to initialize swarm mode. They can talk to each other over port 9024.

Aug 15 '17 09:08 RehanSaeed

Ok, I now know why the EFS DNS name cannot be resolved. Buried in the AWS docs "Mounting on Amazon EC2 with a DNS Name" it says:

The connecting EC2 instance must be inside a VPC and must be configured to use the DNS server provided by Amazon.

@ddebroy Is there a way around this problem? It's a pretty debilitating limitation on Amazons part.

Aug 15 '17 09:08 RehanSaeed

I span up a new swarm with EFS disabled. The cloudstor plugin now shows up as enabled. However, one of my three nodes did not start with swarm mode enabled:

OK hostname=ip-10-2-1-232-bridgeinternationalacademies-com session=1502794527-2zG0bx67BXeyRfYi7P4rgLiBQ7OmxNSj

@ddebroy Can you check my session ID? 2/3 started correctly which suggests this is not a DNS issue.

Aug 15 '17 10:08 RehanSaeed

@RehanSaeed It's great that you managed to root-cause the EFS DNS issue! I was under the impression that with the DNS options enabled in the VPC, the name resolutions will be performed by EC2's DNS and was not aware of the custom DHCP Options Sets config through which you configured custom DNS to take precedence. Now that the root cause for the cloudstor issue in private VPCs with DHCP Options Set configured is understood, I think we should close this issue and track the swarm join debugging (below) in a separate issue.

Going through your last log and delving more into your previous logs involving (10.2.1.117 as a manager node trying to join 10.2.0.245 as leader):

We see the request 10.2.1.117 -> 10.2.0.245:9024 reaching and being serviced by 10.2.0.245:9024:

2017-08-14T07:41:54.425644320Z Path:[GET] /token/manager/ 
2017-08-14T07:41:54.425669072Z User: Wget [10.2.1.117:38460]
2017-08-14T07:41:54.425672867Z userIP: 10.2.1.117 on port 38460

Yet 10.2.1.117 sees wget errors according to it's init logs:

2017-08-14T07:40:52.325087784Z wget: can't connect to remote host (10.2.0.245): Connection refused
2017-08-14T07:42:06.401646114Z wget: error getting response

Your latest logs for 10.2.1.232 is also showing a similar pattern when trying to get the tokens from 10.2.0.209:9024:

2017-08-15T10:34:30.876765250Z wget: can't connect to remote host (10.2.0.209): Connection refused
2017-08-15T10:35:41.501217581Z wget: error getting response

From the init logs of 10.2.0.245 I found that the initial connection refused errors are due to the meta-aws container not being up and listening on 9024 (based on timestamps from the log entries) when the other nodes being to query. However the real culprit is the final wget attempt that is failing with wget: error getting response in both cases above. Can you do a curl on the /token/manager/ endpoint of the leader node (i.e. on 10.2.0.245:9024) and see what errors you get from a manager node like 10.2.1.117 ? Typically if meta-aws runs into errors it logs the errors in the meta-aws logs but those logs appear to be quite clean:

2017-08-14T07:41:00.402580760Z AWS service
2017-08-14T07:41:54.425644320Z Path:[GET] /token/manager/ 
2017-08-14T07:41:54.425669072Z User: Wget [10.2.1.117:38460]
2017-08-14T07:41:54.425672867Z userIP: 10.2.1.117 on port 38460
2017-08-14T07:41:54.425675760Z 
2017-08-14T07:41:54.429460770Z Path:[GET] /token/manager/ 
2017-08-14T07:41:54.429473919Z User: Wget [10.2.2.101:57376]
2017-08-14T07:41:54.429477509Z userIP: 10.2.2.101 on port 57376
2017-08-14T07:41:54.429480302Z 
2017-08-14T08:13:33.404339542Z Path:[GET] /instances/all/ 
2017-08-14T08:13:33.404366402Z User: curl/7.52.1 [172.17.0.1:35634]
2017-08-14T08:13:33.404370712Z userIP: 172.17.0.1 on port 35634
2017-08-14T08:13:33.404373600Z 
2017-08-14T08:14:10.207420023Z Path:[GET] /instances/all/ 
2017-08-14T08:14:10.207459795Z User: curl/7.52.1 [172.17.0.1:35662]
2017-08-14T08:14:10.207463869Z userIP: 172.17.0.1 on port 35662
2017-08-14T08:14:10.207466865Z 
2017-08-14T08:14:36.557445524Z Path:[GET] /instances/all/ 
2017-08-14T08:14:36.557480106Z User: curl/7.52.1 [172.17.0.1:35680]
2017-08-14T08:14:36.557483938Z userIP: 172.17.0.1 on port 35680
2017-08-14T08:14:36.557487020Z

Aug 15 '17 18:08 ddebroy

EFS Issue

I think the EFS issue can be worked around by using the EFS IP address and availability zone instead of the DNS name. This would require a change in the cloud formation template. Is this a common scenario? I would have thought that a lot of users have a domain controller, so it would be.

Swarm Initialization Failure

I confirmed that the meta-aws container is running on all three of my nodes. When I curl {IP Address}, I do get a response ourputting /token. When I curl {IP Address}:9024/token/manager/ I get an error:

~ $ curl 10.2.0.209:9024/token/manager/
curl: (52) Empty reply from server
~ $ curl 10.2.1.232:9024/token/manager/
curl: (52) Empty reply from server
~ $ curl 10.2.2.187:9024/token/manager/
curl: (52) Empty reply from server

Aug 16 '17 08:08 RehanSaeed

@RehanSaeed You are correct about using the EFS IP for mounting as a workaround. However the mount IPs are different for each subnet (corresponding to each AZ) and there is no CFN template mechanism I am aware of that allows passing the correct mount target IP in the (common) customdata based on a ASG VM's region. Will have to research this a bit and potentially move the mount command line within our agent.

Regarding the swarm issue: It seems like the meta server in 10.2.0.209 is not working correctly. If it did, it should have return Access Denied when initiating the curl from a manager node that is already part of the swarm. Can you push up the diag logs from 10.2.0.209 just in case they have any further details when processing the requests?

Aug 16 '17 19:08 ddebroy

It would be nice to get an EFS fix, I think this issue should remain open. I have opened https://github.com/docker/for-aws/issues/92 for the swarm initialization failure.

Aug 17 '17 08:08 RehanSaeed

Since this hasn't been documented anywhere here is how to install the plugin without the CloudFormation template (will xpost here: https://forums.docker.com/t/enable-efs-for-cloudstor-aws-plugin/37447). The plugin requires two EFS and you must note their IDs because it needs to be passed to the install command as EFS_ID and EFS_ID_MAXIO. see @ddebroy comment below.

All concern about DNS is handled by not changing any defaults in Amazon. Simply launch your instance and EFS in the same VPC. Configure the security group to allow NFS traffic from itself and assign the security group to both EFS and your EC2 instances. Note the DNS name for your EFS. The EFS ID will look something like fs-abcd0123
Test mounting EFS from one of your instances $ mkdir -p /mnt/reg/efs $ mount fs-abcd0123.efs.us-east-2.amazonaws.com:/ /mnt/efs/reg $ df -T If the file system mounts properly then any issues past this point are solely related to the docker plugin. Remember EFS is just NFS. $ umount /mnt/reg/efs
Remove the plugin if it's already installed $ docker plugin ls $ docker plugin rm cloudstor:aws
The crucial bit when installing the plugin is setting EFS_ID_REGULAR and EFS_ID_MAXIO to the EFS shares you created earlier. $ docker plugin install --alias cloudstor:aws --grant-all-permissions docker4x/cloudstor:17.06.0-ce-aws2 CLOUD_PLATFORM=AWS EFS_ID_REGULAR=fs-abcd0123 AWS_REGION=us-east-2 EFS_SUPPORTED=1 DEBUG=1 AWS_STACK_ID=nostack EFS_ID_MAXIO=fs-abcd2222
$ docker plugin ls should show cloudstor:aws as enabled. Rerun $ df -T and verify the NFS shares are mounted. You'll see two at /mnt/efs/reg and /mnt/efs/max. You will not be charged for the second one unless you store data in it.

Sep 07 '17 19:09 bennnjamin

@bennnjamin Thanks for documenting the steps. I would recommend updating the steps above to not use the same EFS for EFS_ID_REGULAR and EFS_ID_MAXIO as that may lead to duplicate enumeration of cloudstor backed volumes (which maps to directories in the EFS). Keeping the two separate is safer and allows you to use the maxio option for volumes if desired later. Note that the separate EFS won't lead to additional charges from AWS unless any data is kept there.

Sep 07 '17 19:09 ddebroy

@ddebroy I will update to clarify that cloudstor:aws requires two EFS. I only have one EFS I set them to the same ID. Otherwise you will experience this error Error response from daemon: dial unix /run/docker/plugins/c3427e5f18c08d25845c03be2da134546047639f6e54a56d945005d0a873c7d4/cloudstor.sock: connect: no such file or directory

It would be nice if the plugin worked without having to create two EFS

Sep 07 '17 19:09 bennnjamin

You can also start CloudStor without EFS support, so it only uses EBS volumes:

docker plugin install --alias cloudstor:aws --grant-all-permissions docker4x/cloudstor:17.06.0-ce-aws2 CLOUD_PLATFORM=AWS DEBUG=1 AWS_REGION=us-east-1 EFS_SUPPORTED=0 AWS_STACK_ID=nostack

That way you don't have to setup two EFS.

Nov 07 '17 19:11 mfwmtech

Could this be a timing issue where the EFS resource/mount point is not available yet? So, when the plugin is installed it cannot mount the EFS volume?

In the Docker for AWS template I do not see a dependency between the ManagerAsg, for instance, and the MountTargetX (e.g. MountTargetGP1, etc) objects. Does this mean that a manager instance could start before the EFS mount target is available? If the EFS mount target is not available will the effect be a installed, but disabled cloudstor plugin? If so, then I believe the defect is in the CF template.

I'm new to Cloudformation and to docker swarm/cloudstor so I may be misinterpreting things, but my own experimentation leads me to believe that the error and disabled cloudstor plugin is related to when the EFS mount targets are completed. When the stack is being created sometimes the EFS mounts are completed before the cloudstor plugin and sometimes they are not ready.

Dec 06 '17 14:12 gsnyder2007

YES! Thank you so much for all the info y'all have put in here. I'm also defining a new AWS docker swarm through terraform, and this stuff was invaluable. I don't think I would've been able to get cloudstor working without this info.

One other thing for any other confused nerd that might stumble upon this: your EFSs must have mount targets that point to your VPC in order for your EC2 instances to find them. https://www.terraform.io/docs/providers/aws/r/efs_mount_target.html <- that must exist & point your EFSs to your VPC.

Also! The security group that you have associated with your EC2 instances & EFSs must have the port 2049 open for TCP traffic.

Thanks again!

Aug 09 '18 17:08 new-guy

Similar results with Cloudstor plugin on azure

Dec 11 '18 19:12 bitsofinfo

@bitsofinfo let's not pollute a thread about Docker for AWS with Azure concerns, but I had the same issue with cloudstor.sock in a Docker Swarm and resolved it by turning off "Secure Transfer Enabled" for the storage account. Not ideal, but I can't find anywhere to report an issue with the plugin...

Jan 14 '19 16:01 alexvy86

for-aws for-aws copied to clipboard

Cloudstor Plugin Disabled in Fresh Stable Install

Expected behavior

Actual behavior

Information

Steps to reproduce the behavior

Side Question

Checking Security Groups

Checking EFS Mount Target

Checking Port 9024

EFS Issue

Swarm Initialization Failure

for-aws
for-aws copied to clipboard