for-aws
for-aws copied to clipboard
docker stack is stuck in `NEW` current state
Expected behavior
docker stack deploy --compose-file files/docker-compose-traefik.yml --with-registry-auth traefik
Updating service traefik_network_private (id: XXX)
docker service ps traefik_network_private
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
SOMEID traefik_network_private.SOMEID traefik:v1.3.5 ip-172-28-64-168 Running Running
Actual behavior
docker service ps traefik_network_private
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
SOMEID traefik_network_private.SOMEID traefik:v1.3.5 ip-172-28-64-168 Running New 17 minutes ago
Some services are in this state for 6 days already
Information
- Full output of the diagnostics from "docker-diagnose" ran from one of the instance
Your diagnostics session ID is 1506018294-iQQFhMcHc2lOMwqwabb9IkfEUVOjNW5P
Please provide this session ID to the maintainer debugging your issue.
I have faced this issue during routing stack upgrade process. It turned out that this issue affects 3 different cluster.
The one, I'm sending the diagnostic id from, hasn't been touched for at least 10 days.
Other two clusters
Your diagnostics session ID is 1506018755-6hdDn56RfdZQCZNw5bKplDtiJK4O8IGm
Your diagnostics session ID is 1506018835-aModhOmVGcplyi00JAqI0phOrXlhS3VE
At least one weird thing spotted, common to 2 out of 3 clusters - some worker nodes have been terminated by ASG/ELB, hence new nodes appeared at some point. However
~ $ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
e1jgf187kja7f6ksux2bdv72z * ip-172-28-144-81 Ready Active Reachable
io608qnh758fcfug8al7l157d ip-172-28-17-174 Ready Active
jxpg1pf6pv2omulnv7u2u8fh7 ip-172-28-8-233 Down Active
k19z4ogy74ce7eqyyw7tfofs8 ip-172-28-6-227 Down Active
obyelsom9w6sjb0rojbz4clgp ip-172-28-6-176 Ready Active Reachable
rqoziumjk5ci2yfo89wx51cei ip-172-28-21-131 Down Active
sn5bvlq3inogjkm6prasx5tz9 ip-172-28-153-102 Down Active
w171y8od74jor4cahsm1agnyq ip-172-28-66-9 Ready Active
yuzesd5gxvzny0bs14fcbxf16 ip-172-28-64-168 Ready Active Leader
still shows some nodes with Down status. I do confirm that these nodes are not there.
I had cleaned up those Down nodes from one of the clusters - didn't help.
How can I resolve this ?
@netflash what version are you on?
~ $ docker version
Client:
Version: 17.06.0-ce
API version: 1.30
Go version: go1.8.3
Git commit: 02c1d87
Built: Fri Jun 23 21:15:15 2017
OS/Arch: linux/amd64
Server:
Version: 17.06.0-ce
API version: 1.30 (minimum version 1.12)
Go version: go1.8.3
Git commit: 02c1d87
Built: Fri Jun 23 21:51:55 2017
OS/Arch: linux/amd64
Experimental: true
@netflash to get rid of the nodes marked Down, you can issue a docker node rm from the manager to remove them for the stack. We will enhance things a bit in the near future to clean up Down worker nodes automatically.
@ddebroy I know how to get rid of Down nodes.
What I don't know is how to use my stacks which are stuck in NEW state.
@netflash does the above compose file start up fine in other stacks? Can you share the docker-compose-traefik.yml file? I see that docker did not create any replica at all and there are these msgs Could not find network sandbox for container traefik_military_com_private.1.tdgdxlioqwjowrsbsj2m7xj1i on service binding deactivation request
@ddebroy
Here it is - https://gist.github.com/netflash/a81d1deadfa6aa2467e9d03090cd401e.
However I'm not sure if it helps - I have spotted heaps of other services in the same NEW state, and some in Pending state. Some of them were already deployed and worked. I assume they got into this state after the host node had been terminated.
Network has been created with this command
docker network create --driver overlay military_private
By the way - where did you get this messages from ? Asking cuz I haven't seen them anywhere.
@netflash I was just going through the docker engine logs from your diagnostics uploads earlier.
From the logs for traefik_military_com_private I found the service ID is tozjaj6xh6j3.
docker service ls.stdout:tozjaj6xh6j3 traefik_military_com_private global 0/3 traefik:v1.3.5 *:80->80/tcp,*:8080->8080/tcp
Next I found these errors when swarm tried to schedule it:
Sep 21 18:04:48 moby root: time="2017-09-21T18:04:48.942840700Z" level=error msg="Failed allocation for service tozjaj6xh6j3l7ih0yt30ide1" error="could not find an available IP while allocating VIP" module=node node.id=yuzesd5gxvzny0bs14fcbxf16
Sep 21 18:04:48 moby root: time="2017-09-21T18:04:48.947751593Z" level=error msg="task allocation failure" error="service tozjaj6xh6j3l7ih0yt30ide1 to which this task 9jwkusl2h3icsiud66egs403w belongs has pending allocations" module=node node.id=yuzesd5gxvzny0bs14fcbxf16
Sep 21 18:04:48 moby root: time="2017-09-21T18:04:48.947787725Z" level=error msg="task allocation failure" error="service tozjaj6xh6j3l7ih0yt30ide1 to which this task o9lydox2cmmmxf930ppwmxer9 belongs has pending allocations" module=node node.id=yuzesd5gxvzny0bs14fcbxf16
Sep 21 18:04:48 moby root: time="2017-09-21T18:04:48.947805148Z" level=error msg="task allocation failure" error="service tozjaj6xh6j3l7ih0yt30ide1 to which this task eud3ifr8flr1nhb8rxbo8radp belongs has pending allocations" module=node node.id=yuzesd5gxvzny0bs14fcbxf16
Sep 21 18:21:16 moby root: time="2017-09-21T18:21:16.262808272Z" level=error msg="Failed allocation during update of service tozjaj6xh6j3l7ih0yt30ide1" error="could not find an available IP while allocating VIP" module=node node.id=yuzesd5gxvzny0bs14fcbxf16
Sep 21 18:21:16 moby root: time="2017-09-21T18:21:16.262870132Z" level=debug msg="Failed allocation of unallocated service tozjaj6xh6j3l7ih0yt30ide1" error="could not find an available IP while allocating VIP" module=node node.id=yuzesd5gxvzny0bs14fcbxf16
Sep 21 18:21:16 moby root: time="2017-09-21T18:21:16.266973064Z" level=debug msg="task allocation failure" error="service tozjaj6xh6j3l7ih0yt30ide1 to which this task o9lydox2cmmmxf930ppwmxer9 belongs has pending allocations" module=node node.id=yuzesd5gxvzny0bs14fcbxf16
I see similar issue with VIP allocation for your other services too.
Sep 21 17:52:19 moby root: time="2017-09-21T17:52:19.108685643Z" level=error msg="Failed allocation for service dmm9978glgz6vx2uamzmw8nbo" error="could not find an available IP while allocating VIP" module=node node.id=yuzesd5gxvzny0bs14fcbxf16
I am not sure exactly how to debug this scenario. @mavenugo should be able to point to the next steps/right person to debug this further and check out in case some cleanup is not happening when the nodes went down.
@ddebroy thank you. I want to mention that if this would help to find out the root cause and fix it, I'm able to leave one cluster in this state for about a week. As for other two - I will probably re-deploy them on Monday.
@ddebroy Thank you very much
Most likely have run out of IP addresses on that network. Swarm networks default to /24. Each task gets an IP and each service gets a VIP. Have more than 255 $number_tasks+$number_services deployed?
no, I have less than that.
docker network inspect
shows just 3 IPv4 addresses in use.
Actually the number of peers in this network is less than number of nodes. And when I had executed the same command on the "absent" node it showed 0 peers and 0 addresses occupied.
Also this node lives for just 5 hrs,
~ $ uptime
23:32:19 up 5:40, load average: 0.08, 0.06, 0.00
And these lines look weird
Sep 22 17:51:30 moby root: time="2017-09-22T17:51:30.865339927Z" level=warning msg="Running modprobe nf_nat failed with message: `modprobe: module nf_n
at not found in modules.dep`, error: exit status 1"
Sep 22 17:51:30 moby root: time="2017-09-22T17:51:30.866045330Z" level=warning msg="Running modprobe xt_conntrack failed with message: `modprobe: modul
e xt_conntrack not found in modules.dep`, error: exit status 1"
Sep 22 17:51:30 moby root: time="2017-09-22T17:51:30.866111230Z" level=debug msg="Fail to initialize firewalld: Failed to connect to D-Bus system bus:
dial unix /var/run/dbus/system_bus_socket: connect: no such file or directory, using raw iptables instead"
docker network inspect unfortunately only shows only endpoints running on the local host. Using the network control plane (inspect --verbose) as a guide, the number of IPs allocated on the network can be counted as follows from any host with a container attached to the affected network:
docker network inspect --verbose --format '{{range .Services}}{{printf "%s\n" .VIP}}{{range .Tasks}}{{printf "%s\n" .EndpointIP}}{{end}}{{end}}' $NETWORK_NAME |grep -v '^$' |wc -l
I have run that and got 8. Which totally makes sense as almost every service is stuck in "NEW" state. Plus I have less than 10 stacks deployed. @trapier
@ddebroy sorry for poking an old corpse - could you tell me more about that auto-cleanup of DOWN nodes. Was it implemented ? Is there any doc about it ?
closing the issue as I have never reached out to the broken cluster and recreated the other two. Going to redeploy the broken cluster too.
if there any way to cleanup allocated ips? or remove these?
@m
if there any way to cleanup allocated ips? or remove these?
i run into same issue. can i cleanup old address?
Still waiting for an update on this.
@trapier Thank you very much! I have run out of IP addresses on my network, and solved the problem by setting network to /16.
Thank you very much! I have run out of IP addresses on my network, and solved the problem by setting network to
/16.
@trapier Did it really solve the problem or does it just "postpone" the problem so that it will occur after a longer time?
@straurob Has to be a postponement of the problem, but it might work for him if he doesn't run that many containers that take up 65,536 ip's.
I hit this issue again today. Can anyone please share how the old IPs can be cleaned up?
@netflash Does it make sense to re-open this issue as seems like a lot of people are stuck on this?
@akki I'm not with Docker Inc, so can't say if they want to reopen it or not. Personally I went away from docker swarm ages ago.
@netflash Ohh sorry, I thought author of the issue has the permission to re-open it.
You were right.
Thanks.
On Fri, 3 Jan, 2020, 19:00 Alex Romanov, [email protected] wrote:
You were right.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/for-aws/issues/104?email_source=notifications&email_token=ABEUIXH2TFR3JSVUYWAHHSLQ34SHRA5CNFSM4D367OY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIA7LHA#issuecomment-570553756, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABEUIXBYCGS55I5ZUCAFTS3Q34SHRANCNFSM4D367OYQ .