etcd-aws-cluster
etcd-aws-cluster copied to clipboard
curl -f -s http://ASG_SERVER_IP:2379/v2/members hangs
Hello Monsanto, Great process and flow. I was able to bootstrap a cluster pretty quickly.
I've read ETCd clustering post a couple of times and couldn't find a solution/reason to the following:
When the ASG bootstraps the ETCd instances, sporadically, the etc-aws-cluster script would hang when executing the /members cURL command: curl -f -s http://ASG_SERVER_IP:2379/v2/members
.
Lets take hosts A and B. Host A is running etcd-aws-cluster script as part of the cloud-init phase. tried to execute the /members API call on host B. the cURL command hangs. Only when I restart the ETCd systemd.unit on host B, host A can continue the script and the cluster gets bootstrapped.
Hope the description is clear enough for reproduce.
I'm available if more info is required.
Thank you in advance, Yarden
Hi @tj-corrigan, sorry for pinging you directly, can you please take a look?
Hi again, I've added the following random sleep to handle the race condition:
# Sleep randomly to remove the bootstrap race condition
SLEEP_DURATION=$[ ( $RANDOM % 100 ) + 1 ]
echo Going to sleep for $SLEEP_DURATION
sleep $SLEEP_DURATION
This allowed the cluster to boot properly.
Mistakenly closed, still waiting for some input/comment
Thanks for the ping, I hadn't seen the issue. I'll try and take a look to see if I can reproduce sometime today.
Sorry, I don't think I completely understand your issue. Are you seeing this issue during bootstrapping of the original cluster or when adding a new host to an existing cluster? In the initial cluster situation we would expect that the curl command on https://github.com/MonsantoCo/etcd-aws-cluster/blob/master/etcd-aws-cluster#L85 (assuming that is what you are referring to) to fail since no members of the cluster are available yet. In this case the script should just move on and hit the new cluster code.
I suppose we could potentially add a --max-time flag to make sure we never get stuck here?
Hi @tj-corrigan, short: yes for --max-time
I am bootstrapping a 5 node cluster, consistently getting the cURL command to hang when one of the nodes ETCd process is up and another nodes queries its /members API.
Only after adding a random sleep and the --max-time cURL flag the issue was solved.
@ayashjorden do you need both random sleep code AND --max-time cURL? I ran into the sam problem today as well.
I added --max-time
and the random sleep to our fork of the script as well and found it to work. Before we were hanging at line 85 as well, almost every single time (but not 100%). @ayashjorden @tj-corrigan fyi.