for-aws icon indicating copy to clipboard operation
for-aws copied to clipboard

init-aws failing in some commands

Open mwaaas opened this issue 8 years ago • 44 comments

init-aws command failing in the following cases:

  1. Can't connect to manager and goes ahead to try to join swarm

NODE: Can't connect to primary manager, sleep and try again SWARM_STATE=inactive Get Primary Manager IP MANAGER_IP=172.31.1.98 "docker swarm join" requires exactly 1 argument(s). See 'docker swarm join --help'.

  1. Error response from daemon: Timeout was reached before node was joined. The attempt to join the swarm will continue in the background. Use the "docker info" command to see the current swarm status of your node.

init-aws will assume node will be added in the background but in some cases its not added.

Would be really great if we can get access to the code we help to contribute.

mwaaas avatar Apr 08 '17 14:04 mwaaas

Thanks for reporting the problem, can you please give us the following information.

  • what version of docker for AWS are you using?
  • which region
  • what is your configuration (# of managers, workers, etc)
  • are you adding to your own VPC, or is Docker For AWS building it for you?
  • have you changed the Cloudformation template at all? If so, what did you change.

kencochrane avatar Apr 09 '17 11:04 kencochrane

@kencochrane

Version: aws-v17.03.1-ce-aws1 Region: Ireland Configuration: 3 managers, and 5 workers. Vpc: Docker for AWS is building it Haven't changed anything in the cloud formation template

Experiencing the following errors:

  1. docker swarm join command timeout and node unable to join the swarm cluster. here are the logs for this case https://gist.github.com/mwaaas/a19251af8ef0c99b80c4e90c98e77cdf

  2. docker swarm command fails because it's being called without manager token. here are the logs for this case https://gist.github.com/mwaaas/a2277a98944df09f22096925f67ee170

  3. When an instance is terminated due to the health check, seems it does not leave swarm cluster. How do you handle when instance leaves swarm cluster. Can find the code doing that in the cloud formation template.

NB The issue happens when one instance is terminated and when autoscaling group tries to maintain the desired state by launching one instance the instance fails to join the swarm.

Quick question*

  1. I have noticed dynamo db is being used for storing primary manager Ip. What happens if the manager (leader) goes down, haven't seen any logic to update the primary manager IP in the dynamo db.
  2. So if one node goes down, it will be terminated, the question is I can't find the logic that's removing the node from the cluster. And from what have seen it's like the node won't be removed from the cluster( it will be marked as "Down" or "Unreachable")
  3. Am seeing Elastic beanstalk load balancer is checking the health of instances at this port 44554. Have done a curl on that port it returns "LGTM". The question is what application is running on that port and how is it started on the instances.
  4. The docker containers that are orchestrating swarm cluster eg init-aws are public images, so once can easily get the source code, the question is why have docker not yet open sourced the code ?.

mwaaas avatar Apr 10 '17 06:04 mwaaas

@mwaaas thank you for all of the information and questions.

NB The issue happens when one instance is terminated and when autoscaling group tries to maintain the desired state by launching one instance the instance fails to join the swarm.

OK, I think I know what is going on. What is happening is that the primary goes away, and the IP address in dynamodb hasn't been updated yet, so the new node that is coming online isn't able to connect. It should eventually be able to connect once the primary IP is updated when a new leader is elected. The job that updates the IP address currently runs every 4 minutes.

Does the autoscaling group ever get back to desired number of managers, or is it always stuck with too little?

How are the instances terminated, are you doing it manually, or are they crashing?

I have noticed dynamo db is being used for storing primary manager Ip. What happens if the manager (leader) goes down, haven't seen any logic to update the primary manager IP in the dynamo db.

There is a process that works behind the scenes that will update the primary manager IP in dynamodb when a new leader is elected

So if one node goes down, it will be terminated, the question is I can't find the logic that's removing the node from the cluster. And from what have seen it's like the node won't be removed from the cluster( it will be marked as "Down" or "Unreachable")

There currently isn't an automatic removal of this node from the swarm, and it is a manual step.

I have recently made a fix, that should be available in 17.05-ce edge, that will automatically clean up these nodes that are marked as down or unreachable, and have in fact been terminated in EC2. If the fix looks good, it will be included in 17.06 CE stable.

Am seeing Elastic beanstalk load balancer is checking the health of instances at this port 44554. Have done a curl on that port it returns "LGTM". The question is what application is running on that port and how is it started on the instances.

There is a diagnostic service running on the node that responds to this request. It is built into our underlying operating system.

The docker containers that are orchestrating swarm cluster eg init-aws are public images, so once can easily get the source code, the question is why have docker not yet open sourced the code ?.

There are a lot of reasons, but the biggest one, is that we aren't ready yet.

kencochrane avatar Apr 10 '17 18:04 kencochrane

I have a similar issue. After destroying master/leader node and waiting for about 15 minutes I've checked dynamo-db table - nothing changed there. Same IP for destroyed leader. I've ssh to another master and docker node shows me this :

~ $ docker node ls
Error response from daemon: rpc error: code = 4 desc = context deadline exceeded

Newly created master node shows this

#================
Get Primary Manager IP
MANAGER_IP=10.10.130.150
 It's a Manager, run setup
MANAGER_TOKEN=
Setup Manager
wget: can't connect to remote host (10.10.130.150): Host is unreachable
   PRIVATE_IP=10.10.143.143
   PRIMARY_MANAGER_IP=10.10.130.150
   join as Secondary Manager
   Secondary Manager
Get Primary Manager IP
MANAGER_IP=10.10.130.150
PRIMARY_MANAGER_IP=10.10.130.150
wget: can't connect to remote host (10.10.130.150): Host is unreachable
MANAGER_TOKEN=
...
Manager: Primary manager Not ready yet, sleep for 60 seconds.
wget: can't connect to remote host (10.10.130.150): Host is unreachable
   MANAGER_IP=10.10.130.150
   MANAGER_TOKEN=
"docker swarm join" requires exactly 1 argument(s).
See 'docker swarm join --help'.

Usage:  docker swarm join [OPTIONS] HOST:PORT
...
#================ docker node ls ===
Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.
#===================================
Notify AWS that server is ready
ValidationError: Stack arn:aws:cloudformation:us-west-2:700154504226:stack/teststack-vpc-c2/291fe340-1ee1-11e7-9f71-503acbd4dc8d is in UPDATE_COMPLETE state and cannot be signaled
Complete Swarm setup

Eventually it gets destroyed and a new one and this turns to be an endless cycle :-(

My version is docker4x/init-aws:17.04.0-ce-aws1 at existing VPC in private subnets with PubicIPs attached.

netflash avatar Apr 11 '17 19:04 netflash

@netflash How many managers did you start with? When you destroyed the leader, did it promote a secondary manager to a leader? Looking at your output from the other manager, it looks like it did not. Probably because the swarm lost quorum.

kencochrane avatar Apr 11 '17 19:04 kencochrane

3 managers. To clarify - I destroyed the leader instance by simply 'terminate' it in AWS terms.

Also I'm not sure I know where to look about promotion. TBH I've expected to see some changes in docker node ls but as I showed earlier - this command just failed.

Could you please point me about where to look so I can provide you more info on this?

netflash avatar Apr 11 '17 19:04 netflash

@netflash thanks for the clarification. You would be able to see the promotion in docker node ls, it would show the original manager as unavailable, or unreachable (I forget which) and one of the other managers would now be the primary.

I think the issue is that with a 3 manager swarm, when we lose the primary, we are losing the quorum with the other managers, and that is causing the docker node ls to time out (causing your error). If you do a docker info from a secondary manager, it might be able to give more information, if it doesn't also give you an error.

kencochrane avatar Apr 11 '17 19:04 kencochrane

I'll test that in a few. In the mean time here's what I observe when I've terminated master/reachable node:

~ $ docker node ls

ID                           HOSTNAME                                     STATUS  AVAILABILITY  MANAGER STATUS
4znyct0pbzck302fiv2xyl3za *  ip-10-10-4-53.vpc-C.strataconsulting.com     Ready   Active        Reachable
ixs7v3133uxjk2qc2ic1nqaal    ip-10-10-21-67.vpc-C.strataconsulting.com    Ready   Active
iy0d3j6cxa44e5yxdgjw05bfe    ip-10-10-139-94.vpc-C.strataconsulting.com   Ready   Active
mhtga6k5n2sg9m2s1qhwoof5s    ip-10-10-130-131.vpc-C.strataconsulting.com  Ready   Active        Leader
pcd1alvre9drv4gg1ialcr7qx    ip-10-10-83-226.vpc-C.strataconsulting.com   Ready   Active
v3upyl2qb7p9hbc4rw6ru8d0h    ip-10-10-93-3.vpc-C.strataconsulting.com     Down    Active        Unreachable

ASG eventually gets a new node spun-up. and after that I see this

~ $ docker node ls
ID                           HOSTNAME                                     STATUS  AVAILABILITY  MANAGER STATUS
4znyct0pbzck302fiv2xyl3za *  ip-10-10-4-53.vpc-C.strataconsulting.com     Ready   Active        Reachable
ixs7v3133uxjk2qc2ic1nqaal    ip-10-10-21-67.vpc-C.strataconsulting.com    Ready   Active
iy0d3j6cxa44e5yxdgjw05bfe    ip-10-10-139-94.vpc-C.strataconsulting.com   Ready   Active
mhtga6k5n2sg9m2s1qhwoof5s    ip-10-10-130-131.vpc-C.strataconsulting.com  Ready   Active        Leader
ornlqt5np8jti6kw2k8kyk8eh    ip-10-10-86-72.vpc-C.strataconsulting.com    Ready   Active        Reachable
pcd1alvre9drv4gg1ialcr7qx    ip-10-10-83-226.vpc-C.strataconsulting.com   Ready   Active
v3upyl2qb7p9hbc4rw6ru8d0h    ip-10-10-93-3.vpc-C.strataconsulting.com     Down    Active        Unreachable

Master node got added to the cluster as expected.

netflash avatar Apr 11 '17 20:04 netflash

Now I terminated the master/leader

same story here. Well, the circumstances are a bit different, cuz this cluster already has one unreachable instance. Not sure if that gets any impact. Nevertheless, I gonna try with 5 swarm cluster in a few to see what's gonna happen.

Output of docker node ls and docker info from master/reachable instance.

~ $ docker node ls

docker info
Error response from daemon: rpc error: code = 4 desc = context deadline exceeded
~ $
~ $ docker info


Containers: 5
 Running: 4
 Paused: 0
 Stopped: 1
Images: 5
Server Version: 17.04.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: awslogs
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
Swarm: active
 NodeID: 4znyct0pbzck302fiv2xyl3za
 Error: rpc error: code = 4 desc = context deadline exceeded
 Is Manager: true
 Node Address: 10.10.4.53
 Manager Addresses:
  10.10.130.131:2377
  10.10.4.53:2377
  10.10.86.72:2377
  10.10.93.3:2377
Runtimes: runc
Default Runtime: runc
Init Binary:
containerd version: 422e31ce907fd9c3833a38d7b8fdd023e5a76e73
runc version: 9c2d8d184e5da67c95d601382adf14862e4f2228
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.19-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 1.951GiB
Name: ip-10-10-4-53.vpc-C.strataconsulting.com
ID: USDF:HPMT:T2VA:O7MX:WUBS:6TBR:EQT5:OIRS:XV6H:GTXH:SCCR:2CIM
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 75
 Goroutines: 134
 System Time: 2017-04-11T20:04:31.409536117Z
 EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
 os=linux
 region=us-west-2
 availability_zone=us-west-2a
 instance_type=t2.small
 node_type=manager
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

netflash avatar Apr 11 '17 20:04 netflash

5 masters swarm (and 1 worker)

Before termination

~ $ docker node ls
ID                           HOSTNAME                                     STATUS  AVAILABILITY  MANAGER STATUS
fv654rsbzvzdpf4nw981ove5q    ip-10-10-91-249.vpc-C.strataconsulting.com   Ready   Active        Reachable
iai0u08idjgemw3uaf7dmnpl2    ip-10-10-133-103.vpc-C.strataconsulting.com  Ready   Active        Reachable
k8fodadmqwx4of2mbd37huvzn    ip-10-10-136-162.vpc-C.strataconsulting.com  Ready   Active        Reachable
qver0jgbohkxu8ep5m2jreqhf    ip-10-10-18-130.vpc-C.strataconsulting.com   Ready   Active        Reachable
r8k8ddzy1dbvm6v2jokzbsgcq *  ip-10-10-22-111.vpc-C.strataconsulting.com   Ready   Active        Leader
xisi4hcquojaylxnixra5tjvz    ip-10-10-146-77.vpc-C.strataconsulting.com   Ready   Active

~ $ docker info
Containers: 5
 Running: 4
 Paused: 0
 Stopped: 1
Images: 5
Server Version: 17.04.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: awslogs
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
Swarm: active
 NodeID: qver0jgbohkxu8ep5m2jreqhf
 Is Manager: true
 ClusterID: z2zvk375c489qelz3qpjobxb5
 Managers: 5
 Nodes: 6
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 10.10.18.130
 Manager Addresses:
  10.10.133.103:2377
  10.10.136.162:2377
  10.10.18.130:2377
  10.10.22.111:2377
  10.10.91.249:2377
Runtimes: runc
Default Runtime: runc
Init Binary:
containerd version: 422e31ce907fd9c3833a38d7b8fdd023e5a76e73
runc version: 9c2d8d184e5da67c95d601382adf14862e4f2228
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.19-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 1.951GiB
Name: ip-10-10-18-130.vpc-C.strataconsulting.com
ID: 7GKD:YU6X:XQIL:CTAI:BIOY:EP7E:XNPP:VDZ7:YEN5:GSN5:BM4N:X4CP
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 298
 Goroutines: 367
 System Time: 2017-04-11T20:40:28.471369645Z
 EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
 os=linux
 region=us-west-2
 availability_zone=us-west-2a
 instance_type=t2.small
 node_type=manager
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Straight after termination of master/leader.

Output from another master

~ $ docker node ls
ID                           HOSTNAME                                     STATUS   AVAILABILITY  MANAGER STATUS
fv654rsbzvzdpf4nw981ove5q    ip-10-10-91-249.vpc-C.strataconsulting.com   Ready    Active        Reachable
iai0u08idjgemw3uaf7dmnpl2    ip-10-10-133-103.vpc-C.strataconsulting.com  Ready    Active        Leader
k8fodadmqwx4of2mbd37huvzn    ip-10-10-136-162.vpc-C.strataconsulting.com  Ready    Active        Reachable
qver0jgbohkxu8ep5m2jreqhf *  ip-10-10-18-130.vpc-C.strataconsulting.com   Ready    Active        Reachable
r8k8ddzy1dbvm6v2jokzbsgcq    ip-10-10-22-111.vpc-C.strataconsulting.com   Unknown  Active        Unreachable
xisi4hcquojaylxnixra5tjvz    ip-10-10-146-77.vpc-C.strataconsulting.com   Ready    Active

so re-election is done. DynamoDB table is not updated at this point.

Two minutes later DynamoDB table got updated. New node spins-up.

and finally the node got added to the swarm

~ $ docker node ls
ID                           HOSTNAME                                     STATUS  AVAILABILITY  MANAGER STATUS
fv654rsbzvzdpf4nw981ove5q    ip-10-10-91-249.vpc-C.strataconsulting.com   Ready   Active        Reachable
iai0u08idjgemw3uaf7dmnpl2    ip-10-10-133-103.vpc-C.strataconsulting.com  Ready   Active        Leader
k8fodadmqwx4of2mbd37huvzn    ip-10-10-136-162.vpc-C.strataconsulting.com  Ready   Active        Reachable
n0khn2laj3q0lymfs9t12k5h5    ip-10-10-67-164.vpc-C.strataconsulting.com   Ready   Active        Reachable
qver0jgbohkxu8ep5m2jreqhf *  ip-10-10-18-130.vpc-C.strataconsulting.com   Ready   Active        Reachable
r8k8ddzy1dbvm6v2jokzbsgcq    ip-10-10-22-111.vpc-C.strataconsulting.com   Down    Active        Unreachable
xisi4hcquojaylxnixra5tjvz    ip-10-10-146-77.vpc-C.strataconsulting.com   Ready   Active

Outcome - it works as desired with 5 masters swarm.

Question is - why this doesn't work in 3 masters swarm? or if this should not work why documentation allows to create a 3 node cluster ?

netflash avatar Apr 11 '17 20:04 netflash

Thanks for all of the information. It looks like our current recommended setup (3 manager nodes), doesn't handle recovery from a leader node failure correctly. I'm looking at this now, to see what our options are. Things get complicated when we lose manager quorum.

kencochrane avatar Apr 11 '17 21:04 kencochrane

@mwaaas Can you execute docker-diagnose and provide us with the ID?

nathanleclaire avatar Apr 11 '17 21:04 nathanleclaire

I tried to recreate this, and I'm unable to get it to fail like you described. Here is what I did.

  1. I started a 3 manager stack using https://editions-us-east-1.s3.amazonaws.com/aws/edge/Docker.tmpl
  2. once the stack was up and running, I logged into each manager and made sure everything was working fine. running, docker version, node ls, and info
  3. I went into EC2 console, and manually terminated the leader node, and waited until it was terminated
  4. on the other manager nodes (the ones still running) I ran docker node ls, and docker info, and it worked fine. I noticed that the previous manager was still listed, and in 'Unreachable' state.
  5. after a few minutes I saw the new manager join the swarm, now we have 3 managers again.

This all works as we expected. Are you having this problem with your private subnet setup using the customized cloudformation template, or is this with the out of the box cloudformation template, with no changes?

If it is with your customized template, then there is something wrong with your setup, that is causing the problem, and you will need to run the docker-diagnose tool that @nathanleclaire recommended, so we can see what might be causing the problem.

kencochrane avatar Apr 11 '17 22:04 kencochrane

@nathanleclaire I destroyed that stack since I couldn't use it.

@kencochrane I will try to reproduce the issue later in the day, but noticed it was happening if you select t1.micro as your instance try using that.

And thanks for answering questions I had asked, I have a better understanding now. I have another question though - The script that updates the dynamo db with primary node ip, can't find it in the template, is it part of the os ?

mwaaas avatar Apr 12 '17 03:04 mwaaas

@mwaaas I used t2.micros when I tested it out, I haven't used a t1.micro since those aren't really supported anymore. I don't expect that to be much different, but you never know.

The script is inside of the guide system container that is running on the host.

kencochrane avatar Apr 12 '17 11:04 kencochrane

@kencochrane thanks, haven't found time to reproduce the issue, once I do will post the steps to reproduce the issue.

For now, let me read the script in guide image.

Thanks

mwaaas avatar Apr 12 '17 11:04 mwaaas

@kencochrane hi there,

I wasn't able to reproduce this today also.

My approach was to reproduce this issue in 3 different configurations

  1. Create a stack in NEW VPC using docker's template - https://editions-us-east-1.s3.amazonaws.com/aws/edge/Docker.tmpl

  2. Create a stack in existing VPC using docker's template - https://console.aws.amazon.com/cloudformation/home#/stacks/new?stackName=Docker&templateURL=https://editions-us-east-1.s3.amazonaws.com/aws/edge/Docker-no-vpc.tmpl

  3. Create a stack in existing VPC using custom template

Wait till it deployed Ssh into a master/reachable terminate the master/leader verify docker node ls from master/reachable

and no issues today at all. I also have done diff of CF template - no changes since yesterday.

netflash avatar Apr 12 '17 16:04 netflash

@netflash thank you for the update.

kencochrane avatar Apr 12 '17 18:04 kencochrane

Hi @kencochrane

quick question noticed during autoscaling and node fails to join swarm cluster init-aws script fails graceful, expected it to fail and update instance creating failed and terminate the instance.

But what is happening the instance will remain running.

mwaaas avatar Apr 13 '17 18:04 mwaaas

@mwaaas I'm not sure of the question. Are you saying what happens if the init-aws script fails with an error, and stays running? when you say fails graceful, do you mean exits with an error (exit 1), or exits without an error (exit 0)?

Ideally if there is an error with the init script, it shouldn't stay running, the way the autoscaling is setup, it will wait for the signal from the init script to say it is up and running. If it doesn't get that signal before the timeout, it will assume it failed, and terminate the instance.

Is this not happening for you?

kencochrane avatar Apr 13 '17 18:04 kencochrane

@kencochrane So a node executes the init-aws script during cloud-init. My guess is if init-aws script fail the instance initialization should fail and be removed from the autoscaling group. But the scripts seems to be updating notifying aws instance is ready regardless of whether the node joined swarm or not.

Eg. #================ docker node ls === Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again. #=================================== Notify AWS that server is ready ValidationError: Stack arn:aws:cloudformation:eu-west-1:309952364818:stack/HermesStagingDockerSwarm/377ad4b0-19f8-11e7-aaf0-503ac9e74cfd is in CREATE_COMPLETE state and cannot be signaled Complete Swarm setup

so the node wasn't able to join swarm, but the scripts goes ahead to notify aws the server is ready Expected behavior Notify aws server is not ready.

mwaaas avatar Apr 13 '17 18:04 mwaaas

@mwaaas ok, I understand what you are saying now. Thanks for the follow up :).

If it is doing what you are saying it is doing, then yes, we shouldn't return a success, we should first check if the node successfully joined the swarm first, and if so, then return success.

I think it was left the way it was so that the person can investigate what was going on, and then manually add the node to the swarm themselves. This assumption probably doesn't make sense anymore, and if it can't join the swarm, then fail the node, and try again with a new one.

kencochrane avatar Apr 13 '17 19:04 kencochrane

Could you guys clarify this a little bit more ? Pardon my ignorance, but it looks like this script updated the CF Stack, not the ASG. In that case it works only during CF template being deploying, right ? How that affects the ASG in already created stack

netflash avatar Apr 13 '17 20:04 netflash

From the init-script looks like its updating CF if [[ "$HAS_DDC" == "no" ]] ; then echo "Notify AWS that server is ready" cfn-signal --stack $STACK_NAME --resource $INSTANCE_NAME --region $REGION else echo "DDC is installed, it will let AWS know that the server is ready, when it's ready." fi

But even that section of updating stack still fails, sometimes trying to update stack whose status is "create complete"

For instance, in my case this how the update stack was failing

#=================================== Notify AWS that server is ready ValidationError: Stack arn:aws:cloudformation:eu-west-1:309952364818:stack/HermesStagingDockerSwarm/377ad4b0-19f8-11e7-aaf0-503ac9e74cfd is in CREATE_COMPLETE state and cannot be signaled Complete Swarm setup

@kencochrane you had told me to the code that updates dynamo DB is in the guide-aws script, but the only script am seeing on guide-aws container is

#!/bin/sh

# set system wide env variables, so they are available to ssh connections /usr/bin/env > /etc/environment

echo "Initialize logging for guide daemons" # setup symlink to output logs from relevant scripts to container logs ln -s /proc/1/fd/1 /var/log/docker/refresh.log ln -s /proc/1/fd/1 /var/log/docker/watcher.log ln -s /proc/1/fd/1 /var/log/docker/cleanup.log

# start cron /usr/sbin/crond -f -l 9 -L /var/log/cron.log

NB

  • The reason am interested with that script is that I can change it to take more that 10 minutes before updating dynamo DB with primary manager ip, so that I can try and replicate the part where terminating master node, autoscaling group will create a new node, then the node will be unable to join the swarm cluster, but it will still remain as running according to aws,

The desired behavior is it to fail and auto scaling group to create a new instance and try again.

Or the script is in meta-server container , which seems to be binary.

mwaaas avatar Apr 14 '17 03:04 mwaaas

@netflash

Could you guys clarify this a little bit more ? Pardon my ignorance, but it looks like this script updated the CF Stack, not the ASG. In that case it works only during CF template being deploying, right ? How that affects the ASG in already created stack

You are correct, I was mistaken, the signal is only for CloudFormation. The ASG will use the health checks to decide if the node needs to be removed from the group or not.

kencochrane avatar Apr 14 '17 15:04 kencochrane

@mwaaas it is a script run by cron, it is called refresh.sh

kencochrane avatar Apr 14 '17 15:04 kencochrane

@kencochrane and how this health-check works ? I mean if the response to that 44554 is generated by the underlaying OS process, does this process know something about it's own role in the cluster?

Am seeing Elastic beanstalk load balancer is checking the health of instances at this port 44554. Have done a curl on that port it returns "LGTM". The question is what application is running on that port and how is it started on the instances.

There is a diagnostic service running on the node that responds to this request. It is built into our underlying operating system.

netflash avatar Apr 14 '17 15:04 netflash

@netflash the diagnostic service will check if docker is up and running, and a few other things, if not, it will return an error, which will trigger the ASG to remove it.

kencochrane avatar Apr 14 '17 15:04 kencochrane

@kencochrane thanks have seen the script.

mwaaas avatar Apr 14 '17 16:04 mwaaas

@kencochrane thanks!

netflash avatar Apr 14 '17 18:04 netflash