etcd-aws-cluster icon indicating copy to clipboard operation
etcd-aws-cluster copied to clipboard

curl -f -s http://ASG_SERVER_IP:2379/v2/members hangs

Open ayashjorden opened this issue 7 years ago • 8 comments

Hello Monsanto, Great process and flow. I was able to bootstrap a cluster pretty quickly.

I've read ETCd clustering post a couple of times and couldn't find a solution/reason to the following: When the ASG bootstraps the ETCd instances, sporadically, the etc-aws-cluster script would hang when executing the /members cURL command: curl -f -s http://ASG_SERVER_IP:2379/v2/members.

Lets take hosts A and B. Host A is running etcd-aws-cluster script as part of the cloud-init phase. tried to execute the /members API call on host B. the cURL command hangs. Only when I restart the ETCd systemd.unit on host B, host A can continue the script and the cluster gets bootstrapped.

Hope the description is clear enough for reproduce.

I'm available if more info is required.

Thank you in advance, Yarden

ayashjorden avatar Sep 10 '17 12:09 ayashjorden

Hi @tj-corrigan, sorry for pinging you directly, can you please take a look?

ayashjorden avatar Sep 13 '17 07:09 ayashjorden

Hi again, I've added the following random sleep to handle the race condition:

# Sleep randomly to remove the bootstrap race condition
SLEEP_DURATION=$[ ( $RANDOM % 100 )  + 1 ]
echo Going to sleep for $SLEEP_DURATION
sleep $SLEEP_DURATION

This allowed the cluster to boot properly.

ayashjorden avatar Sep 13 '17 11:09 ayashjorden

Mistakenly closed, still waiting for some input/comment

ayashjorden avatar Sep 13 '17 11:09 ayashjorden

Thanks for the ping, I hadn't seen the issue. I'll try and take a look to see if I can reproduce sometime today.

tjcorr avatar Sep 13 '17 13:09 tjcorr

Sorry, I don't think I completely understand your issue. Are you seeing this issue during bootstrapping of the original cluster or when adding a new host to an existing cluster? In the initial cluster situation we would expect that the curl command on https://github.com/MonsantoCo/etcd-aws-cluster/blob/master/etcd-aws-cluster#L85 (assuming that is what you are referring to) to fail since no members of the cluster are available yet. In this case the script should just move on and hit the new cluster code.

I suppose we could potentially add a --max-time flag to make sure we never get stuck here?

tjcorr avatar Sep 13 '17 14:09 tjcorr

Hi @tj-corrigan, short: yes for --max-time

I am bootstrapping a 5 node cluster, consistently getting the cURL command to hang when one of the nodes ETCd process is up and another nodes queries its /members API.

Only after adding a random sleep and the --max-time cURL flag the issue was solved.

ayashjorden avatar Sep 17 '17 08:09 ayashjorden

@ayashjorden do you need both random sleep code AND --max-time cURL? I ran into the sam problem today as well.

xueshanf avatar Nov 23 '17 06:11 xueshanf

I added --max-time and the random sleep to our fork of the script as well and found it to work. Before we were hanging at line 85 as well, almost every single time (but not 100%). @ayashjorden @tj-corrigan fyi.

bloo avatar Feb 02 '18 19:02 bloo