rabbitmq-server Khepri: timeouts when one of the nodes stops responding

Khepri: timeouts when one of the nodes stops responding

Open mkuratczyk opened this issue 5 months ago • 0 comments

Describe the bug

During chaos tests where one of the VMs/nodes is suddenly restarted, timeouts like this occur:

   crasher:
     initial call: rabbit_prequeue:init/1
     pid: <0.1007.0>
     registered_name: []
     exception exit: {{badrecord,
                          {error,
                              {timeout,
                                  {rabbitmq_metadata,
                                      '[email protected]'}}}},
                      [{dict,map_dict,2,[{file,"dict.erl"},{line,467}]},
                       {rabbit_amqqueue,internal_delete,3,
                           [{file,"rabbit_amqqueue.erl"},{line,1805}]},
                       {rabbit_amqqueue_process,'-terminate_delete/3-fun-1-',
                           7,
                           [{file,"rabbit_amqqueue_process.erl"},{line,332}]},
                       {rabbit_amqqueue_process,terminate_shutdown,2,
                           [{file,"rabbit_amqqueue_process.erl"},{line,362}]},
                       {gen_server2,terminate,3,
                           [{file,"gen_server2.erl"},{line,1158}]},
                       {gen_server2,handle_msg,2,
                           [{file,"gen_server2.erl"},{line,1048}]},
                       {proc_lib,wake_up,3,
                           [{file,"proc_lib.erl"},{line,251}]}]}

   crasher:
     initial call: rabbit_channel:init/1
     pid: <0.90831.0>
     registered_name: []
     exception exit: {{case_clause,
                          {error,
                              {timeout,
                                  {rabbitmq_metadata,
                                      '[email protected]'}}}},
                      [{rabbit_channel,binding_action,10,
                           [{file,"rabbit_channel.erl"},{line,1825}]},
                       {rabbit_channel,handle_method,3,
                           [{file,"rabbit_channel.erl"},{line,1614}]},
                       {rabbit_channel,handle_cast,2,
                           [{file,"rabbit_channel.erl"},{line,631}]},
                       {gen_server2,handle_msg,2,
                           [{file,"gen_server2.erl"},{line,1056}]},
                       {proc_lib,init_p_do_apply,3,
                           [{file,"proc_lib.erl"},{line,241}]}]}
       in function  gen_server2:terminate/3 (gen_server2.erl, line 1172)

Of course timeouts are not unexpected when machines disappear, but we need to think through these scenarios and decide what to do. Either ways, we should not log such stacktraces probably.

Reproduction steps

It was a chaos test with a workload, including queue deletions and random restarts.

Expected behavior

Additional context

No response

Mar 15 '24 10:03 mkuratczyk

rabbitmq-server rabbitmq-server copied to clipboard

Khepri: timeouts when one of the nodes stops responding

Describe the bug

Reproduction steps

Expected behavior

Additional context

rabbitmq-server
rabbitmq-server copied to clipboard