gproc icon indicating copy to clipboard operation
gproc copied to clipboard

gproc_dist hangs on leader election for new nodes

Open ybogdanov opened this issue 9 years ago • 9 comments

Hi,

I have a strange behavior when I'm adding new nodes to a cluster. And It's easy to reproduce:

%% ~/.hosts.erlang
'127.0.0.1'.

Open two terminals (term1, term2).

# term1
git clone https://github.com/uwiger/gproc gproc1
cd gproc1
GPROC_DIST=true make
alias start='erl -pa ebin deps/*/ebin -name [email protected] -eval "application:start(gproc), net_adm:world(), gproc_dist:start_link()."'
start
%% term1 (erl)
Erlang/OTP 17 [erts-6.1] [source] [64-bit] [smp:4:4] [async-threads:10] [kernel-poll:false]

Eshell V6.1  (abort with ^G)
([email protected])1> nodes().
[]
([email protected])2> gproc_dist:get_leader().
'[email protected]'
([email protected])3>
# term2
git clone https://github.com/uwiger/gproc gproc2
cd gproc2
GPROC_DIST=true make
alias start='erl -pa ebin deps/*/ebin -name [email protected] -eval "application:start(gproc), net_adm:world(), gproc_dist:start_link()."'
start
%% term2 (erl)
Erlang/OTP 17 [erts-6.1] [source] [64-bit] [smp:4:4] [async-threads:10] [kernel-poll:false]

Eshell V6.1  (abort with ^G)
([email protected])1> nodes().
['[email protected]']
([email protected])2> gproc_dist:get_leader().
** exception exit: {timeout,{gen_leader,local_call,[gproc_dist,get_leader]}}
     in function  gen_leader:call/2 (src/gen_leader.erl, line 326)
([email protected])3>

___Shut down test1 node_**

%% term2 (erl)
([email protected])3> gproc_dist:get_leader().
'[email protected]'
([email protected])4>

___Start test1 node again_**

%% term1 (erl)
Erlang/OTP 17 [erts-6.1] [source] [64-bit] [smp:4:4] [async-threads:10] [kernel-poll:false]

Eshell V6.1  (abort with ^G)
([email protected])1> nodes().
['[email protected]']
([email protected])2> gproc_dist:get_leader().
'[email protected]'
([email protected])3>
%% term2 (erl)
([email protected])4> gproc_dist:get_leader().
'[email protected]'
([email protected])5>

Tried also with Erlang R16B03.

ybogdanov avatar Aug 19 '14 09:08 ybogdanov

Seems that the problem is somewhere in known nodes list. If start gproc like this (specifying the list of nodes):

erl -pa ebin deps/*/ebin -name [email protected] -eval "net_adm:world(), application:set_env(gproc, gproc_dist, ['[email protected]', '[email protected]']), application:start(gproc)"

The example above works. So the problem is that test2 starts, goes to safe_loop(Server, candidate, NewE,{init}) and waiting for some message from.. leader I guess? But leader (test1 node) doesn't know anything about it.

Is there any way to tell test1's gproc_dist to add second node?

ybogdanov avatar Aug 27 '14 11:08 ybogdanov

I haven't had time to look at this yet. I'm sorry about that. But your analysis seems correct. I don't recall having written code in gproc to prepare for adding of new nodes.

Have you tried using the locks_leader version of gproc?

uwiger avatar Aug 30 '14 08:08 uwiger

After reading this issue I tried the uw-locks_leader branch as well and I keep seeing crashes in locks_agent.erl line 1195 where it is being called with an empty list [], normally when I have two nodes it goes fine, but when adding multiple it breaks, but not in a determined way i.e. not always all nodes ect

I would have 1 server node using: gproc_dist:get_leader(), gproc:reg({p, g, server}, test),

And multiple client nodes using: gproc_dist:get_leader(), gproc:reg({p, g, client}, test),

I do this each time a server and client connect with each other, so it happens multiple times on the server 1 time on each client

[error] locks_agent: aborted, reason: {badmatch,[]}, trace: [{locks_agent,get_locks,2,[{file,"src/locks_agent.erl"},{line,1195}]},{locks_agent,get_locks,2,[{file,"src/locks_agent.erl"},{line,1196}]},{locks_agent,analyse,2,[{file,"src/locks_agent.erl"},{line,1210}]},{locks_agent,handle_locks,1,[{file,"src/locks_agent.erl"},{line,843}]},{locks_agent,handle_info,2,[{file,"src/locks_agent.erl"},{line,608}]},{locks_agent,handle_msg,2,[{file,"src/locks_agent.erl"},{line,256}]},{locks_agent,loop,1,[{file,"src/locks_agent.erl"},{line,231}]},{locks_agent,agent_init,3,[{file,"src/locks_agent.erl"},{line,198}]}]

Is there something basic wrong with my setup, or should I try to debug this issue?

MarkNijhof avatar Dec 15 '14 18:12 MarkNijhof

Looks like it is similar/same to: https://github.com/uwiger/locks/issues/7

MarkNijhof avatar Dec 15 '14 19:12 MarkNijhof

Yes, probably. I've started working on some things in the 'uw-leader-hanging' branch.

Among other things, there is a test in test/locks_leader_tests.erl, where the new locks_ttb.erl is used (see the code). It saves multi-node traces to a list of files, which can be processed using locks_ttb:format(Dir, Outfile).

In the test, 5 nodes are used and a split-brain scenario forced. Usually, it ends with all nodes but one agreeing on which one is the leader. Going through the trace is painful and slow, and I haven't yet found the bug.

uwiger avatar Dec 15 '14 21:12 uwiger

I just tried the 'uw-leader-hanging' branch and it seems leader selection is going fine when I kill the leader and put that node back in. It won't be the same leader afterwards but they all seem to agree. I test this by running: gproc_dist:get_leader() but the registered properties are not distributed across the nodes, some nodes have a few, some have none

MarkNijhof avatar Dec 15 '14 23:12 MarkNijhof

I am still having this issue after a netsplit

unbalancedparentheses avatar Jul 20 '16 03:07 unbalancedparentheses

I'll look into this. I'm leaving for a trip to the US today, which will either mean that I'll find some time, or have less time - not sure yet.

uwiger avatar Jul 20 '16 06:07 uwiger

Let me know if you need any help on how to reproduce this. @manuelolmos also had this issue, so the two of us can help you reproduce this.

unbalancedparentheses avatar Jul 21 '16 21:07 unbalancedparentheses