rabbitmq-server icon indicating copy to clipboard operation
rabbitmq-server copied to clipboard

Khepri: timeouts when one of the nodes stops responding

Open mkuratczyk opened this issue 5 months ago • 0 comments

Describe the bug

During chaos tests where one of the VMs/nodes is suddenly restarted, timeouts like this occur:

   crasher:
     initial call: rabbit_prequeue:init/1
     pid: <0.1007.0>
     registered_name: []
     exception exit: {{badrecord,
                          {error,
                              {timeout,
                                  {rabbitmq_metadata,
                                      '[email protected]'}}}},
                      [{dict,map_dict,2,[{file,"dict.erl"},{line,467}]},
                       {rabbit_amqqueue,internal_delete,3,
                           [{file,"rabbit_amqqueue.erl"},{line,1805}]},
                       {rabbit_amqqueue_process,'-terminate_delete/3-fun-1-',
                           7,
                           [{file,"rabbit_amqqueue_process.erl"},{line,332}]},
                       {rabbit_amqqueue_process,terminate_shutdown,2,
                           [{file,"rabbit_amqqueue_process.erl"},{line,362}]},
                       {gen_server2,terminate,3,
                           [{file,"gen_server2.erl"},{line,1158}]},
                       {gen_server2,handle_msg,2,
                           [{file,"gen_server2.erl"},{line,1048}]},
                       {proc_lib,wake_up,3,
                           [{file,"proc_lib.erl"},{line,251}]}]}
   crasher:
     initial call: rabbit_channel:init/1
     pid: <0.90831.0>
     registered_name: []
     exception exit: {{case_clause,
                          {error,
                              {timeout,
                                  {rabbitmq_metadata,
                                      '[email protected]'}}}},
                      [{rabbit_channel,binding_action,10,
                           [{file,"rabbit_channel.erl"},{line,1825}]},
                       {rabbit_channel,handle_method,3,
                           [{file,"rabbit_channel.erl"},{line,1614}]},
                       {rabbit_channel,handle_cast,2,
                           [{file,"rabbit_channel.erl"},{line,631}]},
                       {gen_server2,handle_msg,2,
                           [{file,"gen_server2.erl"},{line,1056}]},
                       {proc_lib,init_p_do_apply,3,
                           [{file,"proc_lib.erl"},{line,241}]}]}
       in function  gen_server2:terminate/3 (gen_server2.erl, line 1172)

Of course timeouts are not unexpected when machines disappear, but we need to think through these scenarios and decide what to do. Either ways, we should not log such stacktraces probably.

Reproduction steps

It was a chaos test with a workload, including queue deletions and random restarts.

Expected behavior

?

Additional context

No response

mkuratczyk avatar Mar 15 '24 10:03 mkuratczyk