for-aws
for-aws copied to clipboard
Cloudstor Plugin Disabled in Fresh Stable Install
Expected behavior
Use Docker for AWS 17.06 stable channel to do a fresh install in the Ireland AWS region. Expect to see AWS Cloudstor plugin installed and enabled.
Actual behavior
Plugin is installed but disabled. Enabling it causes the following error:
Error response from daemon: dial unix /run/docker/plugins/038e08d3f72ba67503691db2521ba669711f930df5d4a3afd9664b915038f2de/cloudstor.sock: connect: no such file or directory
Information
- Full output of the diagnostics from "docker-diagnose" ran from one of the instance
docker-diagnose
produces no output. Does it log to a file?
Steps to reproduce the behavior
- Use Docker for AWS 17.06 stable channel to do a fresh install in the Ireland AWS region
-
docker plugin ls
Side Question
I just want to confirm that the syntax used to talk to an EBS/EFS volume is correct. So this means that Docker has access to create or delete EBS/EFS volumes for me.
version: '3.3'
services:
artefacts:
image: sonatype/nexus3
ports:
- "8081:8081"
volumes:
- artefacts:/nexus-data
volumes:
artefacts:
driver: "cloudstor:aws"
driver_opts:
backing: relocatable
ebstype: gp2
size: 100
@RehanSaeed did you use the template to use an existing VPC or deploy a fresh VPC?
I would like to see the docker logs from the VMs where enablement is failing. When you run docker-diagnose
, it should generate output similar to below:
OK hostname=... session=1502125537-PMccxiQyCPy4PvEF3P2FXDiy4hMHqNM7
OK hostname=... session=1502125537-PMccxiQyCPy4PvEF3P2FXDiy4hMHqNM7
.
.
.
Done requesting diagnostics.
Your diagnostics session ID is 1502125537-PMccxiQyCPy4PvEF3P2FXDiy4hMHqNM7
Please provide this session ID to the maintainer debugging your issue.
What do you see when you run docker-diagnose
? If docker-diagnose is not working for some reason, can you post the /var/log/docker.log file here from a node where the cloudstor enablement is failing please?
I'm using an existing VPC. docker-diagnose
produces no output on the command line:
~ $ docker-diagnose
~ $
Is there a reason for this? is this another bug? I've attached my log file docker.txt instead.
@RehanSaeed both docker-diagnose not working and the cloudstor issue could be a result of DNS not being properly configured in your existing VPC. Your pre-existing VPCs need to have enableDNSSupport
flag set as mentioned in http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-dns.html . In VPC UI, you need to select the Enable DNS hostnames
option. Without this, the EFS IP is not reachable by cloudstor and most likely it bails right away. docker-diagnose
also depends on DNS to reach the VMs and if that is not configured it will not work. Various other aspects like logging, load balancer integration, etc. are probably also going to fail without DNS being configured in the VPC.
I ran a quick test with the 17.06 template to deploy a fresh VPC in Ireland region and everything came up fine including cloudstor in enabled state.
@ddebroy I checked my VPC and that option is already enabled:
- DNS resolution: yes
- DNS hostnames: yes
I did notice that I could not use the machine names of my EC2 instances from inside my containers, I had to refer to them by internal or external IP address. When I'm logged into the Docker host, I don't have this problem. Not sure if this problem is related.
Is there anything else I can look for? Is there a way I can troubleshoot this problem? I'm also going to get my network team to look at this.
I've been debugging your docker-diagnose script. The following curl returns a curl: (52) Empty reply from server
response:
curl -s http://10.0.0.100:9024/instances/all/
Calling the following manually:
SESSION="$(date +'%s')-$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64 | tr -d "=+/" | dd bs=32 count=1 2>/dev/null)"
curl -X POST -F "session=${SESSION}" -s http://10.0.0.100:44554/diagnose
Returns the following response from one of the three nodes:
OK hostname=ip-10-0-0-100-bridgeinternationalacademies-com session=1502276255-3tN67Y8V9oy4G3HnYfQny6Ua9uFWgrjk
I've not setup a brand new VPC, verified that it can talk to the DNS server and created a new Docker Swarm using the template with the new VPC. One of the nodes starts up in swarm mode while the other two have swarm mode disabled. One thing we do that might be different is that we use domain controllers in the DHCP Options Set settings. Once again, docker-diagnose
does not work, so managed to get the session ID manually instead:
OK hostname=ip-10-2-1-117-bridgeinternationalacademies-com session=1502698488-8ACHicg0bu6InKG8wjpf4i4FrB1h7F47
I also still get the error from the cloudstor plugin. @ddebroy Have you tried creating the template for an existing VPC, does this thing even work?
@RehanSaeed thanks for manually posting the logs! I am now able to access them. Looking through the cloudstor bits, I see the following error when cloudstor tries to initialize and mount the EFS:
Aug 14 07:43:54 moby root: time="2017-08-14T07:43:54Z" level=info msg="time=\"2017-08-14T07:43:54Z\" level=fatal msg=\"Could not mount: mount failed: exit status 1" plugin=759f1fdb8e7661747cedcce30827b0362eee48d3a9955ddd9e89972b35186d34
Aug 14 07:43:54 moby root: time="2017-08-14T07:43:54Z" level=info msg="output=\"mount: bad address 'fs-179748de.efs.eu-west-1.amazonaws.com'\\n\"\" " plugin=759f1fdb8e7661747cedcce30827b0362eee48d3a9955ddd9e89972b35186d34
It appears the hosts are not able to resolve the EFS mount target: fs-179748de.efs.eu-west-1.amazonaws.com
and that is typically due to some form of network/DNS configuration issue in the existing VPC. Once the above step fails, cloudstor init exits with a fatal error leading to the status staying disabled.
Given the above scenario, there are two things you can try:
- Figure out why, the EFS mount target is not resolving in your deployments. I am assuming the EFS and the mount targets have been set up by the CFN but is there something in the VPC that is blocking access?
- Try to disable the EFS feature when initializing in a region with EFS support (such as Ireland) by selecting the "Create EFS prerequsities for CloudStor?" option and setting it to No. This allows you to use EBS only.
@RehanSaeed another thing to check if you have a restrictive Security Rule or Firewall running within your VPC that may be blocking various ports by default (including NFS/RPC ones required above). During initial bringup of the swarm, one of the manager swarm nodes becomes the leader while the other nodes query the leader IP for the token on port 9024. In your logs, it appears the other swarm nodes are unable to query the leader on 9024 most likely due to some sort of firewall refusing connections:
wget: can't connect to remote host (10.2.0.245): Connection refused
So note that you are not even ending up with a swarm with the nodes being part of a swarm cluster.
Checking Security Groups
I checked my VPC's default security group and network ACL's and all inbound and outbound ports are wide open. The Route Table for my VPC's three subnets all have a local route mapped by default and another one for the internet gateway at 0.0.0.0/0
. All other security groups have been applied by the cloud formation template to the EC2 instances.
Checking EFS Mount Target
When I ping fs-179748de.efs.eu-west-1.amazonaws.com
from a node I get:
ping: bad address 'fs-179748de.efs.eu-west-1.amazonaws.com'
When I do it from my local machine I get:
ping: unknown host fs-179748de.efs.eu-west-1.amazonaws.com
Why does this host name not exist? I am using a local domain controller in my DHCP Options Set instead of the default AmazonProvidedDNS
DHCP Options Set. Does the Amazon DNS provider do something special to resolve fs-179748de.efs.eu-west-1.amazonaws.com
?
Checking Port 9024
I also logged into each swarm manager node and curl'ed the other nodes on port 9024. I got a response:
curl 10.2.0.245:9024
token/
I can see no reason why two of my three nodes fail to initialize swarm mode. They can talk to each other over port 9024.
Ok, I now know why the EFS DNS name cannot be resolved. Buried in the AWS docs "Mounting on Amazon EC2 with a DNS Name" it says:
The connecting EC2 instance must be inside a VPC and must be configured to use the DNS server provided by Amazon.
@ddebroy Is there a way around this problem? It's a pretty debilitating limitation on Amazons part.
I span up a new swarm with EFS disabled. The cloudstor plugin now shows up as enabled. However, one of my three nodes did not start with swarm mode enabled:
OK hostname=ip-10-2-1-232-bridgeinternationalacademies-com session=1502794527-2zG0bx67BXeyRfYi7P4rgLiBQ7OmxNSj
@ddebroy Can you check my session ID? 2/3 started correctly which suggests this is not a DNS issue.
@RehanSaeed It's great that you managed to root-cause the EFS DNS issue! I was under the impression that with the DNS options enabled in the VPC, the name resolutions will be performed by EC2's DNS and was not aware of the custom DHCP Options Sets config through which you configured custom DNS to take precedence. Now that the root cause for the cloudstor issue in private VPCs with DHCP Options Set configured is understood, I think we should close this issue and track the swarm join debugging (below) in a separate issue.
Going through your last log and delving more into your previous logs involving (10.2.1.117 as a manager node trying to join 10.2.0.245 as leader):
We see the request 10.2.1.117 -> 10.2.0.245:9024 reaching and being serviced by 10.2.0.245:9024:
2017-08-14T07:41:54.425644320Z Path:[GET] /token/manager/
2017-08-14T07:41:54.425669072Z User: Wget [10.2.1.117:38460]
2017-08-14T07:41:54.425672867Z userIP: 10.2.1.117 on port 38460
Yet 10.2.1.117 sees wget errors according to it's init logs:
2017-08-14T07:40:52.325087784Z wget: can't connect to remote host (10.2.0.245): Connection refused
2017-08-14T07:42:06.401646114Z wget: error getting response
Your latest logs for 10.2.1.232 is also showing a similar pattern when trying to get the tokens from 10.2.0.209:9024:
2017-08-15T10:34:30.876765250Z wget: can't connect to remote host (10.2.0.209): Connection refused
2017-08-15T10:35:41.501217581Z wget: error getting response
From the init logs of 10.2.0.245 I found that the initial connection refused errors are due to the meta-aws
container not being up and listening on 9024 (based on timestamps from the log entries) when the other nodes being to query. However the real culprit is the final wget attempt that is failing with wget: error getting response
in both cases above. Can you do a curl
on the /token/manager/
endpoint of the leader node (i.e. on 10.2.0.245:9024) and see what errors you get from a manager node like 10.2.1.117 ? Typically if meta-aws
runs into errors it logs the errors in the meta-aws
logs but those logs appear to be quite clean:
2017-08-14T07:41:00.402580760Z AWS service
2017-08-14T07:41:54.425644320Z Path:[GET] /token/manager/
2017-08-14T07:41:54.425669072Z User: Wget [10.2.1.117:38460]
2017-08-14T07:41:54.425672867Z userIP: 10.2.1.117 on port 38460
2017-08-14T07:41:54.425675760Z
2017-08-14T07:41:54.429460770Z Path:[GET] /token/manager/
2017-08-14T07:41:54.429473919Z User: Wget [10.2.2.101:57376]
2017-08-14T07:41:54.429477509Z userIP: 10.2.2.101 on port 57376
2017-08-14T07:41:54.429480302Z
2017-08-14T08:13:33.404339542Z Path:[GET] /instances/all/
2017-08-14T08:13:33.404366402Z User: curl/7.52.1 [172.17.0.1:35634]
2017-08-14T08:13:33.404370712Z userIP: 172.17.0.1 on port 35634
2017-08-14T08:13:33.404373600Z
2017-08-14T08:14:10.207420023Z Path:[GET] /instances/all/
2017-08-14T08:14:10.207459795Z User: curl/7.52.1 [172.17.0.1:35662]
2017-08-14T08:14:10.207463869Z userIP: 172.17.0.1 on port 35662
2017-08-14T08:14:10.207466865Z
2017-08-14T08:14:36.557445524Z Path:[GET] /instances/all/
2017-08-14T08:14:36.557480106Z User: curl/7.52.1 [172.17.0.1:35680]
2017-08-14T08:14:36.557483938Z userIP: 172.17.0.1 on port 35680
2017-08-14T08:14:36.557487020Z
EFS Issue
I think the EFS issue can be worked around by using the EFS IP address and availability zone instead of the DNS name. This would require a change in the cloud formation template. Is this a common scenario? I would have thought that a lot of users have a domain controller, so it would be.
Swarm Initialization Failure
I confirmed that the meta-aws
container is running on all three of my nodes. When I curl {IP Address}
, I do get a response ourputting /token
. When I curl {IP Address}:9024/token/manager/
I get an error:
~ $ curl 10.2.0.209:9024/token/manager/
curl: (52) Empty reply from server
~ $ curl 10.2.1.232:9024/token/manager/
curl: (52) Empty reply from server
~ $ curl 10.2.2.187:9024/token/manager/
curl: (52) Empty reply from server
@RehanSaeed You are correct about using the EFS IP for mounting as a workaround. However the mount IPs are different for each subnet (corresponding to each AZ) and there is no CFN template mechanism I am aware of that allows passing the correct mount target IP in the (common) customdata based on a ASG VM's region. Will have to research this a bit and potentially move the mount command line within our agent.
Regarding the swarm issue: It seems like the meta
server in 10.2.0.209 is not working correctly. If it did, it should have return Access Denied
when initiating the curl from a manager node that is already part of the swarm. Can you push up the diag logs from 10.2.0.209 just in case they have any further details when processing the requests?
It would be nice to get an EFS fix, I think this issue should remain open. I have opened https://github.com/docker/for-aws/issues/92 for the swarm initialization failure.
Since this hasn't been documented anywhere here is how to install the plugin without the CloudFormation template (will xpost here: https://forums.docker.com/t/enable-efs-for-cloudstor-aws-plugin/37447). The plugin requires two EFS and you must note their IDs because it needs to be passed to the install command as EFS_ID and EFS_ID_MAXIO. see @ddebroy comment below.
- All concern about DNS is handled by not changing any defaults in Amazon. Simply launch your instance and EFS in the same VPC. Configure the security group to allow NFS traffic from itself and assign the security group to both EFS and your EC2 instances. Note the DNS name for your EFS. The EFS ID will look something like fs-abcd0123
- Test mounting EFS from one of your instances
$ mkdir -p /mnt/reg/efs
$ mount fs-abcd0123.efs.us-east-2.amazonaws.com:/ /mnt/efs/reg
$ df -T
If the file system mounts properly then any issues past this point are solely related to the docker plugin. Remember EFS is just NFS.$ umount /mnt/reg/efs
- Remove the plugin if it's already installed
$ docker plugin ls
$ docker plugin rm cloudstor:aws
- The crucial bit when installing the plugin is setting EFS_ID_REGULAR and EFS_ID_MAXIO to the EFS shares you created earlier.
$ docker plugin install --alias cloudstor:aws --grant-all-permissions docker4x/cloudstor:17.06.0-ce-aws2 CLOUD_PLATFORM=AWS EFS_ID_REGULAR=fs-abcd0123 AWS_REGION=us-east-2 EFS_SUPPORTED=1 DEBUG=1 AWS_STACK_ID=nostack EFS_ID_MAXIO=fs-abcd2222
-
$ docker plugin ls
should show cloudstor:aws as enabled. Rerun$ df -T
and verify the NFS shares are mounted. You'll see two at/mnt/efs/reg
and/mnt/efs/max
. You will not be charged for the second one unless you store data in it.
@bennnjamin Thanks for documenting the steps. I would recommend updating the steps above to not use the same EFS for EFS_ID_REGULAR
and EFS_ID_MAXIO
as that may lead to duplicate enumeration of cloudstor backed volumes (which maps to directories in the EFS). Keeping the two separate is safer and allows you to use the maxio
option for volumes if desired later. Note that the separate EFS won't lead to additional charges from AWS unless any data is kept there.
@ddebroy I will update to clarify that cloudstor:aws requires two EFS. I only have one EFS I set them to the same ID. Otherwise you will experience this error Error response from daemon: dial unix /run/docker/plugins/c3427e5f18c08d25845c03be2da134546047639f6e54a56d945005d0a873c7d4/cloudstor.sock: connect: no such file or directory
It would be nice if the plugin worked without having to create two EFS
You can also start CloudStor without EFS support, so it only uses EBS volumes:
docker plugin install --alias cloudstor:aws --grant-all-permissions docker4x/cloudstor:17.06.0-ce-aws2 CLOUD_PLATFORM=AWS DEBUG=1 AWS_REGION=us-east-1 EFS_SUPPORTED=0 AWS_STACK_ID=nostack
That way you don't have to setup two EFS.
Could this be a timing issue where the EFS resource/mount point is not available yet? So, when the plugin is installed it cannot mount the EFS volume?
In the Docker for AWS template I do not see a dependency between the ManagerAsg, for instance, and the MountTargetX (e.g. MountTargetGP1, etc) objects. Does this mean that a manager instance could start before the EFS mount target is available? If the EFS mount target is not available will the effect be a installed, but disabled cloudstor plugin? If so, then I believe the defect is in the CF template.
I'm new to Cloudformation and to docker swarm/cloudstor so I may be misinterpreting things, but my own experimentation leads me to believe that the error and disabled cloudstor plugin is related to when the EFS mount targets are completed. When the stack is being created sometimes the EFS mounts are completed before the cloudstor plugin and sometimes they are not ready.
YES! Thank you so much for all the info y'all have put in here. I'm also defining a new AWS docker swarm through terraform, and this stuff was invaluable. I don't think I would've been able to get cloudstor working without this info.
One other thing for any other confused nerd that might stumble upon this: your EFSs must have mount targets that point to your VPC in order for your EC2 instances to find them. https://www.terraform.io/docs/providers/aws/r/efs_mount_target.html <- that must exist & point your EFSs to your VPC.
Also! The security group that you have associated with your EC2 instances & EFSs must have the port 2049 open for TCP traffic.
Thanks again!
Similar results with Cloudstor plugin on azure
@bitsofinfo let's not pollute a thread about Docker for AWS with Azure concerns, but I had the same issue with cloudstor.sock
in a Docker Swarm and resolved it by turning off "Secure Transfer Enabled" for the storage account. Not ideal, but I can't find anywhere to report an issue with the plugin...