for-aws Docker for AWS doesn't appear to benefit from ECR Policy during Stack / Service Creation

Expected behavior

When adding Amazon Policy "AmazonEC2ContainerRegistryReadOnly" to the created cloud formation from Docker for AWS this should allow docker find images without having to use aws ecr get-login. Future updates of the service via the scheduler should also be able to access ECR without needing cached credentials via the --with-registry-auth command line argument.

This would give docker swarm parity with ECS provided by Amazon when using ECR as a private registry among not being confusing that Docker for AWS uses some policies, but ignores others.

Actual behavior

Using docker stack deploy -c <composefile.yaml> <stack name> with added Amazon Policy "AmazonEC2ContainerRegistryReadOnly" results in the image not being able to be found:

$ docker-swarm service ps 3sgzt87ip19t
ID            NAME                   IMAGE                                                                            NODE                           DESIRED STATE  CURRENT STATE            ERROR                             PORTS
x6l8jtn9hnw0  esports_wallets.1      ###.dkr.ecr.us-east-1.amazonaws.com/path/to/image:tag  ip-172-31-20-65.ec2.internal   Ready          Rejected 2 seconds ago   "No such image: ###.d…"

Using aws ecr get-login works during the creation using the following command form docker stack deploy -c <composefile.yaml> --with-registry-auth <stack name>, however after the Amazon password expires (12 hours) any mangers or nodes not primed will not be able to find the image as shown above.

Information

Using Docker for AWS just release with docker version 1.13.0. I'm able to use docker stack deploy -c <composefile.yaml> --with-registry-auth <stack name> without issue having done a aws ecr get-login... prior.

However because AWS ECR docker login passwords are only good for 12 hours if the scheduler rebalances the deployed services after 12 hours and the node it's being deployed on doesn't have the existing authentication it fails to authenticate and you fail to find the image.

I've read through the issues and found this issue talking about enhancements to how --with-registry-auth works to allow it to use refresh tokens or the like. https://github.com/docker/docker/issues/24940

I choose to create a new ticket as https://github.com/docker/docker/issues/24940 appears to be more geared towards solving dockerhub and less about ECR. I also opened this at https://github.com/docker/docker/issues/30713, but it was requested that I reopen the issue here.

I've also looked at several posts on the docker for aws forums. The one that pin points the problem with no real solution is this:

https://forums.docker.com/t/possible-to-use-aws-ecr-for-image-registry/22295/3

Considering this is a Docker for AWS and you have developed Docker AWS containers to run on the EC2 instances specific to AWS, why can't we just include the ECR ReadOnly Policy to the ProxyRole created during cloud formation to have access to ECR without having to worry about authentication at all?

I have attempted to add the Policy (AmazonEC2ContainerRegistryReadOnly) to the created ProxyRole used by the InstanceProfile for all Docker Nodes in the cloud formation, however this has zero effect. The result from a service ps is as follows:

$ docker-swarm service ps 3sgzt87ip19t
ID            NAME                   IMAGE                                                                            NODE                           DESIRED STATE  CURRENT STATE            ERROR                             PORTS
x6l8jtn9hnw0  esports_wallets.1      ###.dkr.ecr.us-east-1.amazonaws.com/path/to/image:tag  ip-172-31-20-65.ec2.internal   Ready          Rejected 2 seconds ago   "No such image: ###.d…"

However if I remove the service and create it after typing $(aws ecr get-login --region us-east-1) it works fine, finds the image and runs. However then it's back to 12 hours later I'm likely in the state referenced in the forum link above or in issue #24940.

We've been using ECS for over a year and the behavior of making sure the Instance profile role has the AmazonEC2ContainerRegistryReadOnly policy is all that's needed.

I dug a little down in the dockers deployed on an EC2 instance, cool stuff there how the SSH is actually a docker as well as the metadata service. My assumption is we're not trying to assume the instance policy role during a registry fetch. Or I could be talking out my butt there.

I've even gone as far as modifying the cloud formation template to include the ECR policies at creation time, but made no difference.

Steps to reproduce the behavior

Use AWS Console to apply AmazonEC2ContainerRegistryReadOnly policy to cloud formation generated ProxyRole used by InstancePolicy of Docker For AWS
Create service or stack with an image referenced in ECR
service ps or container inspect to see that image is not found aka authentication wasn't present.

Output of docker version:

Docker version 1.13.0, build 49bf474

Output of docker info:

Containers: 4 Running: 4 Paused: 0 Stopped: 0 Images: 6 Server Version: 1.13.0 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true Logging Driver: awslogs Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Swarm: active NodeID: 6kup6o2m8rmgrydrj9k3p1089 Is Manager: true ClusterID: sgq7930ifiv21xlc94jci3tsd Managers: 1 Nodes: 4 Orchestration: Task History Retention Limit: 5 Raft: Snapshot Interval: 10000 Number of Old Snapshots to Retain: 0 Heartbeat Tick: 1 Election Tick: 3 Dispatcher: Heartbeat Period: 5 seconds CA Configuration: Expiry Duration: 3 months Node Address: 172.31.40.242 Manager Addresses: 172.31.40.242:2377 Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: 03e5862ec0d8d3b3f750e19fca3ee367e13c090e runc version: 2f7393a47307a16f8cee44a37b262e8b81021e3e init version: 949e6fa Security Options: seccomp Profile: default Kernel Version: 4.9.4-moby Operating System: Alpine Linux v3.5 OSType: linux Architecture: x86_64 CPUs: 2 Total Memory: 7.785 GiB Name: ip-172-31-40-242.ec2.internal ID: REJ6:Q2GX:RYQD:KC3A:2ZK2:5JZJ:PT42:BT6B:64TN:XQED:WKBH:H7PC Docker Root Dir: /var/lib/docker Debug Mode (client): false Debug Mode (server): true File Descriptors: 77 Goroutines: 190 System Time: 2017-02-03T14:48:08.410775982Z EventsListeners: 0 Registry: https://index.docker.io/v1/ Experimental: true Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

AWS / Docker for AWS 1 manager and 3 workers

Feb 06 '17 13:02 ambrons

@ambrons so to sum up the issue it looks like there are 2 problems.

Need the IAM permissions to get the ecr tokens (http://docs.aws.amazon.com/AmazonECR/latest/userguide/ecr_managed_policies.html#AmazonEC2ContainerRegistryReadOnly)
The ecr tokens expire every 12 hours, so we need a way to update these tokens before they expire.

Did I miss anything?

Feb 06 '17 20:02 kencochrane

@kencochrane

Sorry for my long winded report I could see how it could be a little confusing.

The primary concern is your number one. I'm not familiar with the internals of how you are handling EC2 Instance Profiles => AWS Roles, so perhaps it is to get the ecr tokens however in my AWS experience the EC2 instance InstanceProfile would be assigned a Role which attached policies. The EC2 instance can assume the role for the purpose of obtaining access to AWS resources.

So in summary the ideal would be allowing the user of Docker for AWS or default inclusion of ECR Policy in CloudFormation. This would give Docker Managers and Nodes access to communicate with ECR without the need of the docker user having to use docker-login at all. It would behave from the docker user's perspective as if it was a public repo. I know ECS does this seamlessly, but I'm not sure if there's some magic they do under the covers to assume the role on the EC2 instance during a docker pull.

In either case if the above is done, the tokens expiring is moot because all instances in the Docker for AWS would have access to ECR all the time based on the above Role / Polices.

Feb 08 '17 01:02 ambrons

@ambrons thank you for the details. This is what I'm thinking, let me know if that works for you.

We add a new template question that asks if you want to enable ECR support (yes or no, default to no)

if they say 'no':

do nothing, same as today.

If they say 'yes':

we add the Read only profile detailed here (http://docs.aws.amazon.com/AmazonECR/latest/userguide/ecr_managed_policies.html#AmazonEC2ContainerRegistryReadOnly) to the IAM role we create. This will allow us to get the ECR tokens
we add a background process that will run the docker login command with the ECR tokens every X hours, to make sure the tokens are never expired.

for 2. I'm not sure if we need to run that on every host, or if we just need to do it on one leader, and swarm will automatically replicate it across, I'll need to do some research first.

Questions:

Would this work for you?
Would Read Only be fine, if someone needs more then read only, they will need to manually update the policy/role, which might not be ideal.

Feb 08 '17 14:02 kencochrane

@kencochrane

I agree with the Cloud Formation question for ECR support as well as adding the AmazonEC2ContainerRegistryReadOnly. Let that be a Parameter of the template. I don't think they would need read / write access unless you had a docker running on the swarm as say part of a CI tool with Docker in Docker and you built a docker image and wanted to push it to ECR after it was created. I'm not sure how often that would be the case.

I know for our immediate needs ReadOnly would be fine.

As for the part about ECR tokens every x hours. I'm still not sure that's necessary. I've include the following documentation that shows the interaction between EC2 Instance, InstanceProfile, Role, Policy, and an application running on EC2.

http://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html

I know for the applications we've written-to-date that run in Docker on top of an EC2 Instance we do not have to handle tokens or the like as long as the EC2 Instance InstanceProfile + Role includes the proper permissions in a Policy associated to the Role. Which is why I was surprised when I added the ECR Policy manually to the generated Role from the Docker for AWS cloud formation it didn't just work.

My assumption is that the internal docker containers running for aws meta and console access might be interfering with the natural behavior of the EC2 Instance resource access model.

All of that said, regardless of the underpinnings I believe the net result would be the same. The user of docker swarm would not have to any additional steps to pull from ERC and it would just work assuming the above AmazonEC2ContainerRegistryReadOnly policy was associated with the Docker for AWS Role.

Feb 09 '17 00:02 ambrons

I agree with the Cloud Formation question for ECR support as well as adding the AmazonEC2ContainerRegistryReadOnly. Let that be a Parameter of the template. I don't think they would need read / write access unless you had a docker running on the swarm as say part of a CI tool with Docker in Docker and you built a docker image and wanted to push it to ECR after it was created. I'm not sure how often that would be the case.

I know for our immediate needs ReadOnly would be fine.

OK, we can start with read only for now, and see where that gets us.

As for the part about ECR tokens every x hours. I'm still not sure that's necessary. I've include the following documentation that shows the interaction between EC2 Instance, InstanceProfile, Role, Policy, and an application running on EC2.

http://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html

I know for the applications we've written-to-date that run in Docker on top of an EC2 Instance we do not have to handle tokens or the like as long as the EC2 Instance InstanceProfile + Role includes the proper permissions in a Policy associated to the Role. Which is why I was surprised when I added the ECR Policy manually to the generated Role from the Docker for AWS cloud formation it didn't just work.

The IAM role will allow you to run the aws ecr commands to get the token, but you will still need to run aws ecr get-login every 12 hours, or else the token will expire. It doesn't automatically update your token for you. see http://docs.aws.amazon.com/cli/latest/reference/ecr/get-login.html for more details.

We could use something like this: https://github.com/awslabs/amazon-ecr-credential-helper which is a docker credential store, which will automatically handle collecting the ECR tokens for you, without the need for running docker login, etc.

Feb 09 '17 15:02 kencochrane

@kencochrane

The IAM role will allow you to run the aws ecr commands to get the token, but you will still need to run aws ecr get-login every 12 hours, or else the token will expire. It doesn't automatically update your token for you. see http://docs.aws.amazon.com/cli/latest/reference/ecr/get-login.html for more details. We could use something like this: https://github.com/awslabs/amazon-ecr-credential-helper which is a docker credential store, which will automatically handle collecting the ECR tokens for you, without the need for running docker login, etc.

I guess amazon-ecr-credential-helper is similiar to this which runs on each EC2 Container for docker: https://github.com/aws/amazon-ecs-agent

Okay so now that my magic has been debunked. Is the plan to have the Cloud Formation for SWARM set all of this up for ECR users? Such that if they choose to include the ECR Role Or Docker Repository in their Cloud Formation that the created EC2 instances will keep the tokens refreshed?

So to recap, now that I'm on the same page with you. I don't have an issue using aws ecr get-login on my local system over an SSH tunnel to the docker swarm manager to initially create deployments and stacks. That works as advertised. What isn't solved is the issue of after 12 hours the token used during the install expiring such that the scheduler gets borked if the service is updated or rebalanced post 12 hours.

So ignoring the IAM Role for a minute I guess the real issue is keeping the authentication refreshed post service or stack create / deployment.

Is there a way to auto configure / deploy https://github.com/awslabs/amazon-ecr-credential-helper as a global container along side the other global docker containers per cloud formation instance such that this all just works?

Having never used the docker credential helper before It appears you might not need AWS KEY / SECRET if you use the AWS_SDK_LOAD_CONFIG=true for assuming the role of the instance. Or perhaps that's in addition to AWS KEY and SECRET.

Feb 09 '17 21:02 ambrons

short answer. Yes, the goal is to allow someone to pick the "enable ECR support" and it will deal with the IAM profile, and keeping the Token up to date.

Feb 09 '17 21:02 kencochrane

Hi,

first of all, thanks for this cloudformation template. This is really awesome.

We are also very interested in this feature. Did you already experiment with this or is there even any schedule when ECR integration gets implemented?

Many thanks Daniel

Feb 27 '17 14:02 DanielMaier-BSI

No ECR support is a really big problem for us. We want to move from ECS to Docker for AWS, but would like to have all our build scripts remain mostly the same...

May 18 '17 13:05 jeffhuys

@jeffhuys it is on the road map, current estimate is for it to arrive with 17.06-ce

May 18 '17 20:05 kencochrane

It has been a while since I have commented on this issue, so I think it is time for an update, to let everyone know where we are with this issue.

Before I get into the current status, I'll give a little background.

The way ECR handles auth is by using IAM roles for your servers, that allow you to get short lived Auth tokens, which can be used to docker push/pull your images.

There are two common ways for doing this. you can do aws ecr get-login or use the amazon-ecr-credential-helper. This will get your auth tokens and put them in the right location so that docker can use them, just like any other credential.

Docker for AWS uses Swarmkit for our orchestration, and when you deploy your service, you can use the --with-registry-auth flag which will take the ECR credentials you have locally and pass them to the workers who are going to run your service containers, so they have access to those credentials.

This all works as it should, and if you never need the ECR auth tokens again, there would be no problem. Unfortunately, you will most likely need this token in the future. Even if you are not going to be doing a redeploy, you would need it, if a host goes down, and your containers get rescheduled to a new node. If this new node doesn't have the required images already loaded in it's image cache, it will need to pull down those images from ECR. If this happens after the ECR auth token expired, the image pulls from ECR will fail with invalid credentials.

Another important thing to consider, is that there are two different parts to docker, the server and the client. The server is the docker daemon that runs on the host and manages the containers, etc. The client is a CLI binary which interacts with a given docker daemon, it might be locally, or it could be on a remote server. Docker Auth credentials are handled by the docker client, and the docker server doesn't have access to them. Because of this, we can't use aws ecr get-login or amazon-ecr-credential-helper on swarm workers to get new auth tokens, when the current ones expire.

One work around would be to get a new auth token and then do a docker service update every X hours before the token expires, but this would cause a rolling update of the service to update the registry creds in all worker nodes, which isn't ideal. We have some other ideas as well, but they are also not very good. What ever solution we finally choose, we want to make sure it is something that works well for everyone, and won't make things worse in the long run.

The only viable solution is to make changes to the way docker swarm handles registry auth credentials. We are currently working with that team, to come up with a solution that works not just for ECR but for all registries. If you want to track the progress feel free to follow this issue: https://github.com/moby/moby/issues/24940

Until those changes are added to swarm, we are currently blocked. I'll be sure to update this issue, with our progress, to let everyone know where we stand.

May 22 '17 19:05 kencochrane

As a work around for this(I didn't want to install aws cli on the manger node), I was able to utilize the aws cli on the "guide-aws" container that is started on my manager node. So by running the following, I can capture the login command:

docker exec -it guide-aws sh -c 'aws ecr get-login --region us-east-1'

Then, I've set up a cron to run on the manager every 6 hours to re-login the manager.

Would love some thoughts on this...

Jun 17 '17 14:06 blaketastic2

@blaketastic2 did this approach work for you?

At first glance, I don't think it will work because running the aws ect get-login will not update the password that is tied to the service. Swarm takes the password on service create, encrypts it and stores it in service config. If that password changes the service will need to be updated in order for the new password to be set.

Jun 19 '17 14:06 kencochrane

As far as I can tell, this cron works: $ docker exec -it guide-aws sh -c 'aws ecr get-login --region us-east-1' | tr -d '\r' > login && ./login

Followed by this service update: $ docker service update --with-registry-auth serviceA

Jun 19 '17 14:06 blaketastic2

@blaketastic2 yes, if you run the service update after the aws ecr get-login that should work. It isn't ideal since it will cause a rolling update of the service. You will also need to remember to update every service that needs the new credentials.

If it works for you, then 👍

Jun 19 '17 15:06 kencochrane

From talking to @diogomonica, this is how progress will be made on this problem:

Docker will get plug'able external secret stores
--with-registry-auth transitions to using docker secrets for storing registry creds (right now I think it's just embedded in the service def)
We implement an EC2 instance metadata pseudo-secret store
That gets wired up so that swarm dynamically reads ECR registry creds as a secret, with that secret provided by the pseudo-secret EC2 instance metadata secrets plugin.

Jul 17 '17 19:07 friism

Any guestimates, rough or otherwise when we might be able to see this option in the cloud formation template?

Jul 25 '17 10:07 RehanSaeed

@friism @kencochrane this looks good to me, thanks for the update, much appreciated!

Jul 25 '17 18:07 ambrons

Hi @friism, do you know when the solution you explained will be available?

@kencochrane, is there any way to update the service without doing a rolling upgrade? I mean only update credentials every x hours but without causing a rolling update of the service. This would be a good workaround while the final solution is developed.

Thanks!

Sep 06 '17 08:09 destebanm

As of right now, I do not know of a way to update the credentials without causing a rolling update of the service.

Sep 06 '17 14:09 kencochrane

OK, @kencochrane thanks! So I suppose that we have to wait until solution explained by @friism is available in EC2 instances and docker swarm. Meanwhile we have to do a rolling update after login in ECR, because is the only way to be sure that credentials are spread and new EC2 instances run the service succesfully

Sep 11 '17 09:09 destebanm

@blaketastic2 I also don't want to install aws cli, but when doing

docker exec -it guide-aws sh -c 'aws ecr get-login --region eu-west-1'

I get the following error:

An error occurred (AccessDeniedException) when calling the GetAuthorizationToken operation: User: arn:aws:sts::xxxxxxxxxxxxx:assumed-role/Docker-xxxxx-swarm-ProxyRole-xxxxxxx/i-xxxxxxxxxxxxxxxxxxx is not authorized to perform: ecr:GetAuthorizationToken on resource: *

Did you also do something else to get this working?

thx

Oct 12 '17 19:10 joostaafjes

@blaketastic2 already found out, also had to add the policy AmazonEC2ContainerRegistryReadOnly to the role.

thx for the elegant solution. I slight modified the oneliner to prevent writing a file...

so my cron looks like:

0 */6 * * * eval "$(docker exec -it guide-aws sh -c 'aws ecr get-login --region eu-west-1 --no-include-email'| tr -d '\r')" && docker service update --with-registry-auth main_xxxx

Oct 12 '17 20:10 joostaafjes

would that crontab run on moby (as root or docker?), in the shell-aws container (presumably as docker), or elsewhere?

also, are the credentials tied to the specific service, or would updating one token service which wouldn't be affected by restarts be sufficient to distribute the credentials?

very much looking forward to a more elegant and permanent solution to this problem...

Nov 03 '17 03:11 tomalok

@tomalok the guide-aws container is already running - The cronjob specified simply runs the command in the container. Since the container is available to your current user, you can simply set it up without being root.

Nov 03 '17 17:11 FrenchBen

@FrenchBen yes, I understand that the aws ecr get-login command is running in guide-aws container. What I'm after is where crond is running? It's not set up by default on the manager node's shell-aws container. (FWIW, it appears that crond runs in the guide-aws container for doing cleanup, etc.)

If I patch the CloudFormation template (around where the manager and worker instances are having their /etc/docker/daemon.json set up and /home/docker permissions set) to add a crontab entry, presumably it would be the non-containerized crond on the moby instance itself that would be running the ECR auth cron job.

Probably wouldn't be a bad idea to have the CloudFormation template run aws ecr get-login (via guide-aws) to get the credentials in place from the get-go.

Another question is whether or not the credentials would need to be for the docker user, the root user, or both -- for the case where a new worker node comes up and needs to pull an ECR image for a global service.

Nov 03 '17 18:11 tomalok

@tomalok if you're going to modify the template, why not add your own container that gets called in the cronjob - You can bind mount the proper locations for the credential token to be updated. For example, something like: https://hub.docker.com/r/behance/ecr-login/

Nov 07 '17 01:11 FrenchBen

At the moment, the only alterations I'm doing to the CloudFront template is adding one line to the end of each of the UserData scripts for the manager and worker nodes, which pretty much just pulls an install script down from my S3 bucket and executes it.

wget -qO- http://s3-us-west-2.amazonaws.com/my-bucket/install-refresh-ecr-auth.sh | sh -ex

The install script does the rest of the heavy lifting:

writes the refresh-ecr-auth.sh script
executes it once to set initial ECR credentials
updates the crontab to call the refresh-ecr-auth.sh script once every 8 hours

The refresh-ecr-auth.sh script merely

gets the docker login command via docker exec guide-aws aws ecr get-login --region $AWS_REGION --no-include-email
evals the command twice, once for root, and once for the docker user eval $ECR_LOGIN; su docker sh -c "eval $ECR_LOGIN"

Because the moby instance's /home/docker is already bind mounted inside the shell-aws container, the appropriate credentials are already populated in ~/.docker/config.json when you SSH in to the node.

I'm still hand-adding the AmazonEC2ContainerRegistryReadOnly policy to the Worker and Proxy roles, but I'm assuming that should be relatively straightforward to add to the CloudFormation template, too.

Nov 08 '17 03:11 tomalok

@tomalok Sounds like a good approach - any chance you'd consider sharing the (edited) script via a gist?

Nov 09 '17 17:11 FrenchBen

Adding AmazonEC2ContainerRegistryReadOnly policy to the Worker and Proxy roles was rather straightforward, works nice.

I'll probably do a little cleanup and make the template/patch, and install script available via S3 somewhere.

Nov 21 '17 01:11 tomalok

for-aws for-aws copied to clipboard

Docker for AWS doesn't appear to benefit from ECR Policy during Stack / Service Creation

Expected behavior

Actual behavior

Information

Steps to reproduce the behavior

Output of docker version:

Output of docker info:

Additional environment details (AWS, VirtualBox, physical, etc.):

Questions:

for-aws
for-aws copied to clipboard