rabbitmq-server icon indicating copy to clipboard operation
rabbitmq-server copied to clipboard

Prevent excessive logging in certain failure scenarios

Open mkuratczyk opened this issue 1 year ago • 9 comments

Describe the bug

This is an umbrella issue for a bunch of small issues.

Some RabbitMQ components can log excessively when they crash or some other relatively unusual event happens (but it can be a common one actually - a cluster node failure, etc).

The goal is to avoid stuff like this in the logs: Screenshot 2024-03-12 at 12 53 41

or this:

 ** Last message in was {'$gen_cast',terminate}
 ** When Server state == {ch,
                          {conf,running,rabbit_framing_amqp_0_9_1,1,
                           <0.7674.0>,<0.7681.0>,<0.7674.0>,
                           <<"10.52.4.3:36244 -> 10.52.3.58:5672">>,
                           {user,<<"default_user_KVbiqyQs7TiNc32SbiD">>,
                            [administrator],
                            [{rabbit_auth_backend_internal,
                              #Fun<rabbit_auth_backend_internal.3.111050101>}]},
                           <<"/">>,<<>>,<0.7675.0>,
                           [{<<"exchange_exchange_bindings">>,bool,true},
                            {<<"connection.blocked">>,bool,true},
                            {<<"authentication_failure_close">>,bool,true},
                            {<<"basic.nack">>,bool,true},
                            {<<"publisher_confirms">>,bool,true},
                            {<<"consumer_cancel_notify">>,bool,true}],
                           none,0,1800000,#{},1000000000},
                          {lstate,<0.7682.0>,false},
                          none,82301,
                          {263,
                           [{pending_ack,82300,
                             <<"amq.ctag-MvTp7g5Pl6VBW1aGl2R1oQ">>,
                             1710247548856,
                             {resource,<<"/">>,queue,<<"one-3">>},
                             82299},
                            {pending_ack,82299,
                             <<"amq.ctag-MvTp7g5Pl6VBW1aGl2R1oQ">>,
                             1710247548852,
                             {resource,<<"/">>,queue,<<"one-3">>},
                             82298},
                            {pending_ack,82298,
                             <<"amq.ctag-MvTp7g5Pl6VBW1aGl2R1oQ">>,
                             1710247548852,
                             {resource,<<"/">>,queue,<<"one-3">>},
                             82297},
... (many more pending_ack)

Reproduction steps

I'm running chaos tests to see what kind of situations like that I can trigger. So far there are two:

  1. osiris_replica can fail and all of its mailbox gets logged. This is potentially a huge piece of binary data coming from a TCP. For example, after running out of disk space I get this:
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>     reason: {{badmatch,{error,enospc}},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>              [{osiris_log,write_chunk,6,[{file,"osiris_log.erl"},{line,2348}]},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>               {osiris_replica,'-handle_incoming_data/3-fun-0-',2,
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                               [{file,"osiris_replica.erl"},{line,537}]},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>               {lists,foldl,3,[{file,"lists.erl"},{line,1594}]},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>               {osiris_replica,handle_incoming_data,3,
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                               [{file,"osiris_replica.erl"},{line,536}]},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>               {gen_server,try_handle_info,3,
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                           [{file,"gen_server.erl"},{line,1095}]},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>               {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1183}]},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>               {proc_lib,init_p_do_apply,3,
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                         [{file,"proc_lib.erl"},{line,241}]}]}
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>     offender: [{pid,<0.15412.0>},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                {id,"__stream-1_1710241152660281401"},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                {mfargs,{osiris_replica,start_link,undefined}},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                {restart_type,temporary},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                {significant,false},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                {shutdown,5000},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>                {child_type,worker}]
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0> 
2024-03-12 12:05:19.129860+00:00 [error] <0.15439.0> ** Generic server <0.15439.0> terminating
2024-03-12 12:05:19.129860+00:00 [error] <0.15439.0> ** Last message in was {tcp,#Port<0.5937>,
2024-03-12 12:05:19.129860+00:00 [error] <0.15439.0>                             <<0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
2024-03-12 12:05:19.129860+00:00 [error] <0.15439.0>                               136,0,0,1,142,50,127,114,211,0,0,0,0,0,0,0,0,0,
2024-03-12 12:05:19.129860+00:00 [error] <0.15439.0>                               117,176,0,0,19,136,0,0,1,142,50,127,114,211,0,0,

with lots and lots of data later. Moreover this process is continuously restarted and therefore this data is printed over and over as the stream leader keeps trying to send the data to this replica.

  1. A channel can have a large state that gets printed in some cases (see the pending_ack stuff above)

Expected behavior

General recommendations for logging:

  1. ~p should not be used for log formatting
  2. processes should implement format_status/2
  3. maximum size of the log should be configured

Additional context

No response

mkuratczyk avatar Mar 12 '24 14:03 mkuratczyk

another one from the channel process - this time it's the mailbox, not the state:

crasher:
  initial call: rabbit_channel:init/1
  pid: <0.11850.0>
  registered_name: []
  exception exit: noproc
    in function  gen_server2:terminate/3 (gen_server2.erl, line 1172)
  ancestors: [<0.11847.0>,<0.11845.0>,<0.11839.0>,<0.11838.0>,<0.966.0>,
                <0.965.0>,<0.964.0>,<0.962.0>,<0.961.0>,rabbit_sup,
                <0.262.0>]
  message_queue_len: 151
  messages: [{'EXIT',<0.11847.0>,shutdown},
                {'$gen_cast',
                 {queue_event,
                  {resource,<<"/">>,queue,<<"q-3">>},
                  {deliver,<<"amq.ctag-hrgXqfnbKzvHGRhnaoLYog">>,true,
                   [{{resource,<<"/">>,queue,<<"q-3">>},
                     <15143.922.0>,2461847,false,
                     {mc,mc_amqpl,
                      {content,60,none,
                       <<16,0,2>>,
                       rabbit_framing_amqp_0_9_1,
                       [<<0,2,77,48,0,0,1,142,50,202,34,181,0,0,0,0,0,0,0,
                          0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                          0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                          0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                          0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                          0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                          0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                          0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

mkuratczyk avatar Mar 12 '24 14:03 mkuratczyk

osiris:

 ** When Server state == #{name => <<"__stream-1_1710247245159438000">>,
                           log =>
                               #{file => "00000000000118320636.segment",
                                 mode =>
                                     #{type => acceptor,mode => write,
                                       tail =>
                                           {118420442,
                                            {11,118420257,1710257805308}},
                                       epoch => 11},
                                 directory =>
                                     "/var/lib/rabbitmq/mnesia/[email protected]/stream/__stream-1_1710247245159438000",
                                 max_segment_size_chunks => 256000,
                                 retention =>
                                     [{max_bytes,20000000000},{max_age,600000}],
                                 filter_size => 16,
                                 max_segment_size_bytes => 500000000,
                                 tracking_config => #{}},
                           committed_offset => 122425221,
                           external_reference =>
                               {resource,<<"/">>,queue,<<"stream-1">>},
                           has_parse_state => true,num_offset_listeners => 0}
 ** Reason for termination ==
 ** {{accept_chunk_out_of_order,118520700,118420442},
     [{osiris_log,accept_chunk,2,[{file,"osiris_log.erl"},{line,793}]},
      {osiris_replica,'-handle_incoming_data/3-fun-0-',2,
                      [{file,"osiris_replica.erl"},{line,537}]},
      {lists,foldl,3,[{file,"lists.erl"},{line,1594}]},
      {osiris_replica,handle_incoming_data,3,
                      [{file,"osiris_replica.erl"},{line,536}]},
      {gen_server,try_handle_info,3,[{file,"gen_server.erl"},{line,1095}]},
      {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1183}]},
      {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]}
 
   crasher:
     initial call: osiris_replica:init/1
     pid: <0.20591.0>
     registered_name: []
     exception exit: {accept_chunk_out_of_order,118520700,118420442}
       in function  osiris_log:accept_chunk/2 (osiris_log.erl, line 793)
       in call from osiris_replica:'-handle_incoming_data/3-fun-0-'/2 (osiris_replica.erl, line 537)
       in call from lists:foldl/3 (lists.erl, line 1594)
       in call from osiris_replica:handle_incoming_data/3 (osiris_replica.erl, line 536)
       in call from gen_server:try_handle_info/3 (gen_server.erl, line 1095)
       in call from gen_server:handle_msg/6 (gen_server.erl, line 1183)
     ancestors: [osiris_server_sup,osiris_sup,<0.239.0>]
     message_queue_len: 3
     messages: [{tcp,#Port<0.2098>,
                       <<0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                         0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
                         0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

mkuratczyk avatar Mar 13 '24 11:03 mkuratczyk

this one is somewhat different, but still worth fixing - it contains a lot of whitespace for some reason (screenshot to preserve the strange formatting): Screenshot 2024-03-13 at 18 39 29

mkuratczyk avatar Mar 13 '24 17:03 mkuratczyk

Hi Michal, there are edge cases with unlimited retries with grace period between retries. While the problem is of course the retry itself, this also causes gigabytes of not that useful logs (I only have managed to resolve this issue by manually recreating the stream, so it tends to create gbs of logs)

...
2024-08-15 20:20:59.254000+08:00 [warning] <0.3811.0> rabbit_stream_coordinator: failed to get tail of member __static_data_17237229779233216 on rabbit@DESKTOP-R73UFKT in 2 Error: {case_clause,missing_file}
2024-08-15 20:20:59.258000+08:00 [warning] <0.3812.0> rabbit_stream_coordinator: failed to get tail of member __static_data_17237229779233216 on rabbit@DESKTOP-R73UFKT in 2 Error: {case_clause,missing_file}
2024-08-15 20:20:59.262000+08:00 [warning] <0.3813.0> rabbit_stream_coordinator: failed to get tail of member __static_data_17237229779233216 on rabbit@DESKTOP-R73UFKT in 2 Error: {case_clause,missing_file}

Feel free to ignore this post if you feel it does not belong in this issue at all. But if we have an opportunity to avoid repetitive logs like this, it might be valuable to take it on.

sysupbda avatar Aug 15 '24 13:08 sysupbda

@sysupbda those messages can be switched to use debug. However, if they are repeated, this may indicate that a stream replica fails to recover in a certain way or wasn't deleted correctly. Please start a new discussion and attach a few hundred lines of (vetted for sensitive data) logs before the first message of this kind on all nodes. Maybe @acogoluegnes would have a guess.

michaelklishin avatar Aug 15 '24 14:08 michaelklishin

Thank you for your very quick response!

I am collecting more data to do an appropriate separate post for the underlying issue. I promise. But I do think we also have a logging issue here.

Warning or error is probably appropriate for this issue as the stream is not available until the problem is resolved. That would warrant a warning, right? But more than one warning for such a repeated error bloats the logs.

This edge case, bloating the logs, forces to some logging configurations with sometimes undesirable tradeoffs. Sadly it's already happened to my clients 3 times in 3 weeks.

On Thu, 15 Aug 2024, 22:42 Michael Klishin, @.***> wrote:

@sysupbda https://github.com/sysupbda those messages can be switched to use debug. However, if they are repeated, this may indicate that a stream replica fails to recover in a certain way or wasn't deleted correctly. Please start a new discussion and attach a few hundred lines of (vetted for sensitive data) logs before the first message of this kind on all nodes. Maybe @acogoluegnes https://github.com/acogoluegnes would have a guess.

— Reply to this email directly, view it on GitHub https://github.com/rabbitmq/rabbitmq-server/issues/10723#issuecomment-2291418863, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCV3TVXT76MJVA6PMZXP6DZRS45LAVCNFSM6AAAAABESJ5QF2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJRGQYTQOBWGM . You are receiving this because you were mentioned.Message ID: @.***>

sysupbda avatar Aug 15 '24 15:08 sysupbda

@michaelklishin , as promised I have reported the entire topic as a separate issue: https://github.com/rabbitmq/rabbitmq-server/issues/12036

sysupbda avatar Aug 16 '24 16:08 sysupbda

@sysupbda if you run rabbitmq-diagnostics.bat status it will report a "Node data directory". That's what we need.

michaelklishin avatar Aug 16 '24 17:08 michaelklishin

As discussed with @kjnilsson, the retry delay mechanism of the stream coordinator is a bit primitive and could be improved to avoid excessive logging.

acogoluegnes avatar Sep 02 '24 15:09 acogoluegnes