docker-plugin
docker-plugin copied to clipboard
Pull from ECR never completes (no errors)
Summary:
I am unable to pull from ECR - the pull is initiated but never completes and I can't see any errors.
Versions & Error Messages:
- docker-plugin version: 1.2.3 (from pull/827)
- jenkins version: 2.263.3
- docker engine version: 20.10.2
- java version: OpenJDK 1.11.0 (Java 11)
- OS: ubuntu-sever-20.04
- Images: jenkins/inbound-agent with internal CA added to truststore
FROM jenkins/inbound-agent
ENV USR_HOME=/home/jenkins
USER root
RUN mkdir -p "$USR_HOME/.cacerts"
COPY root_ca.crt "$USR_HOME/.cacerts/root_ca.crt"
RUN $JAVA_HOME/bin/keytool -keystore $JAVA_HOME/jre/lib/security/cacerts -storepass changeit \
-import -noprompt -alias internal-ca -file "$USR_HOME/.cacerts/root_ca.crt"
USER ${user}
ENTRYPOINT ["/usr/local/bin/jenkins-agent"]
Stacktraces - there aren't any error messages, just the following:
2021-02-02 13:56:50.732+0000 [id=233] INFO c.n.j.plugins.docker.DockerCloud#canAddProvisionedAgent: Provisioning 'my-aws-account.dkr.ecr.eu-west-2.amazonaws.com/my-org/my-image:latest' on 'my-cloud'; Total containers: 0 (of 100)
2021-02-02 13:56:50.732+0000 [id=233] INFO c.n.j.plugins.docker.DockerCloud#provision: Will provision 'my-aws-account.dkr.ecr.eu-west-2.amazonaws.com/my-org/my-image:latest', for label: 'test-ecr', in cloud: 'my-cloud'
2021-02-02 13:56:50.734+0000 [id=233] INFO h.s.NodeProvisioner$StandardStrategyImpl#apply: Started provisioning Image of my-aws-account.dkr.ecr.eu-west-2.amazonaws.com/my-org/my-image:latest from my-cloud with 1 executors. Remaining excess workload: 0
2021-02-02 13:56:50.735+0000 [id=234] INFO c.n.j.p.docker.DockerTemplate#pullImage: Pulling image 'my-aws-account.dkr.ecr.eu-west-2.amazonaws.com/my-org/my-image:latest'. This may take awhile...
2021-02-02 13:56:50.742+0000 [id=234] INFO i.j.docker.client.DockerAPI#getOrMakeClient: Cached connection io.jenkins.docker.client.DockerAPI$SharableDockerClient@8860c1b to DockerClientParameters{dockerUri=tcp://my-cloud:2376, credentialsId=my-creds, readTimeoutInMsOrNull=300000, connectTimeoutInMsOrNull=60000}
2021-02-02 13:59:21.149+0000 [id=236] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$0: Started DockerContainerWatchdog Asynchronous Periodic Work
2021-02-02 13:59:21.150+0000 [id=236] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered
2021-02-02 13:59:21.151+0000 [id=236] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 3, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 1, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 55 ms, Average runtime of container retrieval: 53 ms
2021-02-02 13:59:21.151+0000 [id=236] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 0 nodes assigned to this Jenkins instance, which we will check
2021-02-02 13:59:21.151+0000 [id=236] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Checking Docker Cloud my-cloud at tcp://my-creds:2376
my-cloud-02-02 13:59:21.198+0000 [id=236] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed
2021-02-02 13:59:21.199+0000 [id=236] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$0: Finished DockerContainerWatchdog Asynchronous Periodic Work. 49 ms
2021-02-02 14:02:20.843+0000 [id=237] INFO hudson.model.AsyncAperiodicWork#lambda$doAperiodicRun$0: Started Update IdP Metadata from URL PeriodicWork
2021-02-02 14:02:20.846+0000 [id=237] INFO hudson.model.AsyncAperiodicWork#lambda$doAperiodicRun$0: Finished Update IdP Metadata from URL PeriodicWork. 1 ms
2021-02-02 14:04:21.150+0000 [id=238] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$0: Started DockerContainerWatchdog Asynchronous Periodic Work
2021-02-02 14:04:21.151+0000 [id=238] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered
2021-02-02 14:04:21.152+0000 [id=238] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 4, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 1, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 53 ms, Average runtime of container retrieval: 51 ms
2021-02-02 14:04:21.152+0000 [id=238] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 0 nodes assigned to this Jenkins instance, which we will check
2021-02-02 14:04:21.152+0000 [id=238] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Checking Docker Cloud my-cloud at tcp://my-cloud:2376
2021-02-02 14:04:21.153+0000 [id=238] INFO i.j.docker.client.DockerAPI$1#entryDroppedFromCache: Dropped connection io.jenkins.docker.client.DockerAPI$SharableDockerClient@8860c1b to DockerClientParameters{dockerUri=tcp://my-cloud:2376, credentialsId=my-creds readTimeoutInMsOrNull=300000, connectTimeoutInMsOrNull=60000}
2021-02-02 14:04:21.200+0000 [id=238] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed
As you can see, the pull took over 20 minutes without completing. Pulling the image with the docker cli on my-cloud completed in under 20 seconds. Pulls of images on public DockerHub (no credentials) do succeed.
I'm using the Amazon ECR Credentials plugin (1.6).
Request
It would be great if I could debug this in some way - it's frustrating to not know whats going wrong. I understand that I'm using an un-released/un-approved build but it does seem ulikely that anything in that PR caused this issue. So I guess What I'd like to understand:
- Is this a known issue with a known fix?
- Where can I get more information on what's going wrong to give me a fighting chance of fixing it?
As always, thanks for your work on this plugin - it's much appreciated.
I'm fairly sure someone else raised an issue w.r.t. Amazon before. I've not used it myself, but I think Amaxon doesn't support any standard authentication methods and instead you have to have some extra (non standard) code added into everything you want to talk to it. e.g. there's some extra hook code required to allow the docker cli to talk to Amazon. Thing is, the docker-plugin doesn't use the docker cli so it can't use that Amazon hook code thing. The docker-plugin uses docker-java as its docker client so, unless there's a way of getting docker-java to do the non-standard Amazon-specific authentication stuff, it's not going to work. ...and that Amazon plugin looks unmaintained, so you're likely to be "on your own" debugging that (on the upside, there's a Jenkins process where you can get control of an abandoned plugin, so if you can't get anyone else to merge a PR into it for you, you can take it over and DIY)
I'm not familiar with what the Amazon ecr plugin does, but I do know that this plugin doesn't know about it, so, unless it's doing some extension magic and pushing itself into places that weren't expecting it, it's presence is probably not having much effect at all. My guess is that it'll only affect the docker workflow plugin (which does use the docker cli) and everything else ignores it.
w.r.t. debugging Unfortunately, errors coming back from pull requests seem to disappear. https://github.com/jenkinsci/docker-plugin/blob/7d4658ebe85c95da17a0f642041c2e673db84f24/src/main/java/com/nirima/jenkins/plugins/docker/DockerTemplate.java#L648 is where it's called. With the code as-is, a failed pull request is only detected when the code goes on to attempt to start a container based on the image it just failed to pull. A (small) PR that makes those errors visible in the docker logs would be very welcome and could be merged quickly so you'd get a new build of #827 containing both.
Thanks @pjdarton. I couldn't find anything about ECR in the docs and you're absolutely right - you have to generate a registry token via an AWS API and that token only lives for 12 hours. I did look through the old issues in this repository and found this which seemed to suggest it should work as it is included in the docker-commons-plugin. It seems as though this is the case in the UI, as Jenkins suggests AWS ECR credentials and not others.
I also took a look at trying to get better logging for pull failures. Based on code I lifted from Yet Another Docker, I added this:
diff --git a/src/main/java/com/nirima/jenkins/plugins/docker/DockerTemplate.java b/src/main/java/com/nirima/jenkins/plugins/docker/DockerTemplate.java
index 539e35e..2f5d250 100644
--- a/src/main/java/com/nirima/jenkins/plugins/docker/DockerTemplate.java
+++ b/src/main/java/com/nirima/jenkins/plugins/docker/DockerTemplate.java
@@ -9,7 +9,7 @@ import com.github.dockerjava.api.exception.NotFoundException;
import com.github.dockerjava.api.model.ContainerConfig;
import com.github.dockerjava.api.model.PortBinding;
import com.github.dockerjava.api.model.PullResponseItem;
-import com.github.dockerjava.core.command.PullImageResultCallback;
+import com.github.dockerjava.api.command.PullImageResultCallback;
import com.google.common.base.Strings;
import com.nirima.jenkins.plugins.docker.launcher.DockerComputerLauncher;
import com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy;
@@ -46,6 +46,7 @@ import org.slf4j.LoggerFactory;
import javax.annotation.CheckForNull;
import javax.annotation.Nonnull;
import java.io.IOException;
+import java.io.PrintStream;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Collections;
@@ -645,13 +646,35 @@ public class DockerTemplate implements Describable<DockerTemplate> {
final PullImageCmd cmd = client.pullImageCmd(image);
final DockerRegistryEndpoint registry = getRegistry();
DockerCloud.setRegistryAuthentication(cmd, registry, Jenkins.getInstance());
- cmd.exec(new PullImageResultCallback() {
- @Override
- public void onNext(PullResponseItem item) {
- super.onNext(item);
- listener.getLogger().println(item.getStatus());
+ try {
+ cmd.exec(new PullImageResultCallback() {
+ @Override
+ public void onNext(PullResponseItem item) {
+ super.onNext(item);
+ PrintStream ps = listener.getLogger();
+ if (item.getId() != null) {
+ ps.print(item.getId());
+ ps.print(": ");
+ }
+ ps.println(item.getStatus());
+ if (item.getErrorDetail() != null) {
+ ps.println(item.getErrorDetail().toString());
+ }
+ }
+ }).awaitCompletion();
+ } catch(DockerClientException dce) {
+ String exMsg = dce.getMessage();
+ if (exMsg.contains("Could not pull image: Digest:") ||
+ exMsg.contains(": downloaded")) {
+ try {
+ client.inspectImageCmd(image).exec();
+ } catch (NotFoundException ignored) {
+ throw dce;
+ }
+ } else {
+ throw dce;
+ }
}
- }).awaitCompletion();
}
long pullTime = System.currentTimeMillis() - startTime;
However I still see nothing in the jenkins logs. Once the plugin has started a pull that never completes for a given node tag it also never seems to try again - i.e. although the pull timeout is set to 300s (5m), it never actually times out, even if you wait for 10 minutes.
In fact, going crazy (and very messy) with the attempts to catch any error at all, I tried this:
diff --git a/src/main/java/com/nirima/jenkins/plugins/docker/DockerTemplate.java b/src/main/java/com/nirima/jenkins/plugins/docker/DockerTemplate.java
index 539e35e..0557168 100644
--- a/src/main/java/com/nirima/jenkins/plugins/docker/DockerTemplate.java
+++ b/src/main/java/com/nirima/jenkins/plugins/docker/DockerTemplate.java
@@ -9,7 +9,7 @@ import com.github.dockerjava.api.exception.NotFoundException;
import com.github.dockerjava.api.model.ContainerConfig;
import com.github.dockerjava.api.model.PortBinding;
import com.github.dockerjava.api.model.PullResponseItem;
-import com.github.dockerjava.core.command.PullImageResultCallback;
+import com.github.dockerjava.api.command.PullImageResultCallback;
import com.google.common.base.Strings;
import com.nirima.jenkins.plugins.docker.launcher.DockerComputerLauncher;
import com.nirima.jenkins.plugins.docker.strategy.DockerOnceRetentionStrategy;
@@ -46,6 +46,7 @@ import org.slf4j.LoggerFactory;
import javax.annotation.CheckForNull;
import javax.annotation.Nonnull;
import java.io.IOException;
+import java.io.PrintStream;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Collections;
@@ -55,6 +56,7 @@ import java.util.List;
import java.util.Map;
import java.util.Objects;
import java.util.Set;
+import java.util.concurrent.TimeUnit;
import static com.nirima.jenkins.plugins.docker.utils.JenkinsUtils.bldToString;
import static com.nirima.jenkins.plugins.docker.utils.JenkinsUtils.endToString;
@@ -643,15 +645,50 @@ public class DockerTemplate implements Describable<DockerTemplate> {
try(final DockerClient client = api.getClient(pullTimeout)) {
final PullImageCmd cmd = client.pullImageCmd(image);
+ LOGGER.info("Acquiring registry to execute pull command for {}", image);
final DockerRegistryEndpoint registry = getRegistry();
DockerCloud.setRegistryAuthentication(cmd, registry, Jenkins.getInstance());
- cmd.exec(new PullImageResultCallback() {
- @Override
- public void onNext(PullResponseItem item) {
- super.onNext(item);
- listener.getLogger().println(item.getStatus());
+ LOGGER.info("Authentication set, executing pull command for {} with timeout {}", image, pullTimeout);
+ try {
+ boolean completed = cmd.exec(new PullImageResultCallback() {
+ @Override
+ public void onNext(PullResponseItem item) {
+ try {
+ super.onNext(item);
+ PrintStream ps = listener.getLogger();
+ if (item.getId() != null) {
+ ps.print(item.getId());
+ ps.print(": ");
+ LOGGER.info(item.getId() + ": " + item.getStatus());
+ }
+ ps.println(item.getStatus());
+ LOGGER.info(item.getStatus());
+ if (item.getErrorDetail() != null) {
+ ps.println(item.getErrorDetail().toString());
+ LOGGER.error(item.getErrorDetail().toString());
+ }
+ } catch(Throwable t) {
+ t.printStackTrace();
+ LOGGER.error("pull error", t);
+ }
+ }
+ }).awaitCompletion(pullTimeout + 10, TimeUnit.SECONDS);
+ LOGGER.info("Stopped pulling {}, completed: {}", image, completed);
+ if(!completed) {
+ throw new DockerClientException("pull of " + image + "failed to complete");
+ }
+ } catch(DockerClientException dce) {
+ String exMsg = dce.getMessage();
+ if (exMsg.contains("Could not pull image: Digest:") || exMsg.contains(": downloaded")) {
+ try {
+ client.inspectImageCmd(image).exec();
+ } catch (NotFoundException ignored) {
+ throw dce;
+ }
+ } else {
+ throw dce;
+ }
}
- }).awaitCompletion();
}
long pullTime = System.currentTimeMillis() - startTime;
And see the following in the logs:
2021-02-03 12:34:34.689+0000 [id=76] INFO c.n.j.plugins.docker.DockerCloud#canAddProvisionedAgent: Provisioning 'my-aws-account.dkr.ecr.eu-west-2.amazonaws.com/my-org/my-repo:latest' on 'my-cloud'; Total containers: 0 (of 100)
2021-02-03 12:34:34.690+0000 [id=76] INFO c.n.j.plugins.docker.DockerCloud#provision: Will provision 'my-aws-account.dkr.ecr.eu-west-2.amazonaws.com/my-org/my-repo:latest', for label: 'test-ecr', in cloud: 'my-cloud'
2021-02-03 12:34:34.690+0000 [id=76] INFO h.s.NodeProvisioner$StandardStrategyImpl#apply: Started provisioning Image of my-aws-account.dkr.ecr.eu-west-2.amazonaws.com/my-org/my-repo:latest from my-cloud with 1 executors. Remaining excess workload: 0
2021-02-03 12:34:34.691+0000 [id=74] INFO c.n.j.p.docker.DockerTemplate#pullImage: Pulling image 'my-aws-account.dkr.ecr.eu-west-2.amazonaws.com/my-org/my-repo:latest'. This may take awhile...
2021-02-03 12:34:34.692+0000 [id=74] INFO c.n.j.p.docker.DockerTemplate#pullImage: Acquiring registry to execute pull command for my-aws-account.dkr.ecr.eu-west-2.amazonaws.com/my-org/my-repo:latest
2021-02-03 12:34:35.591+0000 [id=74] INFO c.n.j.p.docker.DockerTemplate#pullImage: Authentication set, executing pull command for my-aws-account.dkr.ecr.eu-west-2.amazonaws.com/my-org/my-repo:latest with timeout 300
2021-02-03 12:36:02.142+0000 [id=82] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$0: Started DockerContainerWatchdog Asynchronous Periodic Work
2021-02-03 12:36:02.143+0000 [id=82] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered
2021-02-03 12:36:02.144+0000 [id=82] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 0, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms
2021-02-03 12:36:02.146+0000 [id=82] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 0 nodes assigned to this Jenkins instance, which we will check
2021-02-03 12:36:02.146+0000 [id=82] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Checking Docker Cloud my-cloud at tcp://my-cloud:2376
2021-02-03 12:36:02.232+0000 [id=82] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed
2021-02-03 12:36:02.233+0000 [id=82] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$0: Finished DockerContainerWatchdog Asynchronous Periodic Work. 90 ms
2021-02-03 12:40:20.958+0000 [id=84] INFO hudson.model.AsyncAperiodicWork#lambda$doAperiodicRun$0: Started Update IdP Metadata from URL PeriodicWork
2021-02-03 12:40:20.960+0000 [id=84] INFO hudson.model.AsyncAperiodicWork#lambda$doAperiodicRun$0: Finished Update IdP Metadata from URL PeriodicWork. 0 ms
2021-02-03 12:41:02.142+0000 [id=85] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$0: Started DockerContainerWatchdog Asynchronous Periodic Work
2021-02-03 12:41:02.144+0000 [id=85] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered
As you can see, the last message from this pull attempt is at 12:34:35, and states it will timeout after 300s (5m). But by 12:41:02 it still hasn't timed out and I see nothing in the logs - nothing from inside the callback, nothing from after the callback, and no stack traces. I even tried using both the logger inside the callback and the static logger in the DockerTemplate class just in case one wasn't actually being written to the logs. I also added a time out to awaitCompletion
, but this never seems to get triggered. Am I going crazy/doing something completely wrong?
More interesting info - I've added a log line to print out the actual authentication (just before cmd.exec
):
String authCfg = "";
if(cmd.getAuthConfig() != null) {
authCfg = cmd.getAuthConfig().toString();
}
LOGGER.info("Authentication set to {}, executing pull command for {} with timeout {}", authCfg, image, pullTimeout);
And with this I see that the AWS ECR plugin really does seem to be working - the credentials are correctly set (username=AWS, password=ecr-jwt). I copied out the JWT and used it to succesfully login:
$ docker pull my-account.dkr.ecr.eu-west-2.amazonaws.com/my-org/my-repo:latest
Error response from daemon: Head https://my-account.dkr.ecr.eu-west-2.amazonaws.com/v2/my-org/my-repo/manifests/latest: no basic auth credentials
$ echo 'my-jwt' | docker login --username AWS --password-stdin my-account.dkr.ecr.eu-west-2.amazonaws.com`
WARNING! Your password will be stored unencrypted in /root/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store
Login Succeeded
$ docker pull my-account.dkr.ecr.eu-west-2.amazonaws.com/my-org/my-repo:latest
latest: Pulling from my-org/my-repo
...
Status: Downloaded newer image for my-account.dkr.ecr.eu-west-2.amazonaws.com/my-org/my-repo:latest
At this stage I am starting to wonder if something more subtle is broken in my previous patch to upgrade to docker-java 3.2.2? It definitely seems like functionality that should work.
Ok, I disabled the TLS verification that caused the issue in #825 and then installed the plugins fresh from jenkins ci and this all works. So, somehow, #827 has broken this. I just wish I knew how and what the right way of juggling these dependencies is.
Probably i have a similar issue described here.