yet-another-docker-plugin
yet-another-docker-plugin copied to clipboard
Jobs accumulate and get stuck in build-queue
Hi, I have run into an issue that is intermittent using this plugin. Often times on our Jenkins jobs will accumulate in the build-queue and not do anything. There seems to be nothing in the Jenkins System log at all, nothing get reported to it at all. They will stay stuck in the build-queue, and no containers will be provisioned until I manually reboot the master node and then on reboot, the plugin then reports to the system log that its provisioning boxes, and then our jobs run. In the meantime the slave nodes are completely empty and are ready to receive jobs. The images are already on the boxes and do seem to normally provision, run, and tear-down without issue.
What kind of reporting can I set-up to better help troubleshoot this issue? Having to reboot our Jenkins box every couple hours is becoming disruptive. Not sure what could be causing it because everything else with the plugin works fine.
What version of the plugin are you using?
You can enable verbose logging in Jenkins for specific classes. Jenkins class NodeProvisioner
is the Jenkins core class which handles node provisioning.
Visit https://<your jenkins>/log
and configure a logger for it.
WARNING: do not configure verbose logging for all Jenkins classes or you will crash Jenkins due to processing load and have to manually modify config to restore functoinality.
See also logger for:
- https://github.com/KostyaSha/yet-another-docker-plugin/blob/master/yet-another-docker-plugin/src/main/java/com/github/kostyasha/yad/DockerProvisioningStrategy.java
- https://github.com/KostyaSha/yet-another-docker-plugin/blob/master/yet-another-docker-plugin/src/main/java/com/github/kostyasha/yad/other/cloudorder/RandomLeastLoadedDockerCloudOrder.java
Do you have more than one docker cloud configured? If so, then you should be using random least loaded provisioning strategy in global setting Docker Cloud Provisioning Strategy Order
(default set to "As is").
See also script console script for detecting stuck label provisioners when your instance gets stuck again.
We seem to have a similar issue. plugin version is 0.1.0-rc48 and the groovy script from the last message doesn't list ant stuck labels.
I've enabled the loggers mentioned and the point after which no new nodes get provisioned is:
Queue length 0 is less than the available capacity 0. No provisioning strategy required
Nov 05, 2018 1:39:59 PM FINER hudson.slaves.NodeProvisioner
Queue length 0 is less than the available capacity 0. No provisioning strategy required
Nov 05, 2018 1:39:59 PM FINER hudson.slaves.NodeProvisioner
Consulting hudson.plugins.ec2.NoDelayProvisionerStrategy@1908037b provisioning strategy with state StrategyState{label=tcwg-d05_64-build, snapshot=LoadStatisticsSnapshot{definedExecutors=3, onlineExecutors=3, connectingExecutors=0, busyExecutors=3, idleExecutors=0, availableExecutors=0, queueLength=1}, plannedCapacitySnapshot=0, additionalPlannedCapacity=0}
Nov 05, 2018 1:39:59 PM FINER hudson.slaves.NodeProvisioner
Consulting com.github.kostyasha.yad.DockerProvisioningStrategy@2adf1953 provisioning strategy with state StrategyState{label=tcwg-d05_64-build, snapshot=LoadStatisticsSnapshot{definedExecutors=3, onlineExecutors=3, connectingExecutors=0, busyExecutors=3, idleExecutors=0, availableExecutors=0, queueLength=1}, plannedCapacitySnapshot=0, additionalPlannedCapacity=0}
Nov 05, 2018 1:39:59 PM FINE com.github.kostyasha.yad.DockerProvisioningStrategy apply
Applying provisioning.
Nov 05, 2018 1:39:59 PM FINE com.github.kostyasha.yad.other.cloudorder.RandomLeastLoadedDockerCloudOrder getDockerClouds
Falling back to getting all clouds regardless of availability.
Nov 05, 2018 1:44:51 PM FINE com.github.kostyasha.yad.DockerComputer taskCompleted
Computer DockerComputer{name=aarch64-08-3a6fc6d246a0, slave=DockerSlave{name=aarch64-08-3a6fc6d246a0, containerId=3a6fc6d246a0a8135fe36c0b3314927a8670b9b5a98e0f76b458b6e0bd90fbf9, template=DockerSlaveTemplate[configVersion=1,maxCapacity=1,id=12dade01-4328-4254-87cd-ea4ca1001f1b,labelString=docker-xenial-arm64,launcher=com.github.kostyasha.yad.launcher.DockerComputerSSHLauncher@66e18d50,remoteFs=/home/buildslave,mode=EXCLUSIVE,retentionStrategy=com.github.kostyasha.yad.strategy.DockerOnceRetentionStrategy@27f,numExecutors=1,dockerContainerLifecycle=com.github.kostyasha.yad.DockerContainerLifecycle@24cedcf4[image=linaro/ci-arm64-ubuntu:xenial,pullImage=com.github.kostyasha.yad.commons.DockerPullImage@4fd0fc5c,createContainer=DockerCreateContainer[command=,commands=<null>,entrypoint=<null>,hostname=,dnsHosts=[8.8.8.8],volumes=[/home/buildslave/srv:/home/buildslave/srv:rw , /srv/docker/ssh/buildslave.config:/home/buildslave/.ssh/config:rw , /srv/docker/ssh/buildslave.known_hosts:/home/buildslave/.ssh/known_hosts:rw , /srv/docker/ssh/id_rsa_buildslave.pub:/home/buildslave/.ssh/authorized_keys:ro , /var/lib/libvirt:/var/lib/libvirt:rw , /dev/bus/usb:/dev/bus/usb],volumesFrom=[],environment=[],bindPorts=,bindAllPorts=false,memoryLimit=<null>,cpuShares=<null>,privileged=true,tty=false,macAddress=<null>,extraHosts=[],networkMode=<null>,devices=[],cpusetCpus=,cpusetMems=,links=[],dockerLabels=<null>,shmSize=<null>,restartPolicy=DockerContainerRestartPolicy[policyName=NO,maximumRetryCount=0],workdir=<null>,user=<null>],stopContainer=com.github.kostyasha.yad.commons.DockerStopContainer@27f,removeContainer=com.github.kostyasha.yad.commons.DockerRemoveContainer@5b0f],nodeProperties=[]]}} taskCompleted
Nov 05, 2018 1:44:51 PM FINE com.github.kostyasha.yad.strategy.DockerOnceRetentionStrategy done
terminating aarch64-08-3a6fc6d246a0 since linaro-art-vixlpresubmit-native-armv8 #568 seems to be finished
Nov 05, 2018 1:44:51 PM INFO com.github.kostyasha.yad.DockerSlave _terminate
Requesting disconnect for computer: 'aarch64-08-3a6fc6d246a0'
The end of the log is just completetitions of jobs that were still running.
In all other case in the log, "Falling back to getting all clouds regardless of availability." is followed immediately by "Least loaded randomized DockerCloud:" message. Looking at:
https://github.com/KostyaSha/yet-another-docker-plugin/blob/master/yet-another-docker-plugin/src/main/java/com/github/kostyasha/yad/other/cloudorder/RandomLeastLoadedDockerCloudOrder.java#L38
It appears we may be stuck in getAllDockerClouds(); since I don't see a codepath that could avoid the "Least loaded randomized DockerCloud:" message.
From the java thread jump. looks like switching from netty to jersey might help.
"jenkins.util.Timer [#5]" #61 daemon prio=5 os_prio=0 tid=0x00007fc5040df000 nid=0xf89 waiting on condition [0x00007fc4c895b000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000005f5703360> (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
at com.github.kostyasha.yad_docker_java.com.github.dockerjava.core.async.ResultCallbackTemplate.awaitCompletion(ResultCallbackTemplate.java:92)
at com.github.kostyasha.yad_docker_java.com.github.dockerjava.netty.InvocationBuilder$ResponseCallback.awaitResult(InvocationBuilder.java:60)
at com.github.kostyasha.yad_docker_java.com.github.dockerjava.netty.InvocationBuilder.get(InvocationBuilder.java:189)
at com.github.kostyasha.yad_docker_java.com.github.dockerjava.netty.exec.ListContainersCmdExec.execute(ListContainersCmdExec.java:44)
at com.github.kostyasha.yad_docker_java.com.github.dockerjava.netty.exec.ListContainersCmdExec.execute(ListContainersCmdExec.java:15)
at com.github.kostyasha.yad_docker_java.com.github.dockerjava.netty.exec.AbstrSyncDockerCmdExec.exec(AbstrSyncDockerCmdExec.java:21)
at com.github.kostyasha.yad_docker_java.com.github.dockerjava.core.command.AbstrDockerCmd.exec(AbstrDockerCmd.java:35)
at com.github.kostyasha.yad.DockerCloud.countCurrentDockerSlaves(DockerCloud.java:293)
at com.github.kostyasha.yad.utils.DockerFunctions.countCurrentDockerSlaves(DockerFunctions.java:104)
at com.github.kostyasha.yad.utils.DockerCloudLoadComparator.compare(DockerCloudLoadComparator.java:21)
at com.github.kostyasha.yad.utils.DockerCloudLoadComparator.compare(DockerCloudLoadComparator.java:16)
at java.util.TimSort.binarySort(TimSort.java:296)
at java.util.TimSort.sort(TimSort.java:221)
at java.util.Arrays.sort(Arrays.java:1512)
at java.util.ArrayList.sort(ArrayList.java:1462)
at com.github.kostyasha.yad.other.cloudorder.RandomLeastLoadedDockerCloudOrder.getDockerClouds(RandomLeastLoadedDockerCloudOrder.java:45)
at com.github.kostyasha.yad.DockerProvisioningStrategy.apply(DockerProvisioningStrategy.java:71)
at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320)
at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61)
at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers:
- <0x00000005c871d820> (a java.util.concurrent.ThreadPoolExecutor$Worker)
- <0x00000005cdf35040> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)