rabbitmq-server
rabbitmq-server copied to clipboard
Prevent excessive logging in certain failure scenarios
Describe the bug
This is an umbrella issue for a bunch of small issues.
Some RabbitMQ components can log excessively when they crash or some other relatively unusual event happens (but it can be a common one actually - a cluster node failure, etc).
The goal is to avoid stuff like this in the logs:
or this:
** Last message in was {'$gen_cast',terminate}
** When Server state == {ch,
{conf,running,rabbit_framing_amqp_0_9_1,1,
<0.7674.0>,<0.7681.0>,<0.7674.0>,
<<"10.52.4.3:36244 -> 10.52.3.58:5672">>,
{user,<<"default_user_KVbiqyQs7TiNc32SbiD">>,
[administrator],
[{rabbit_auth_backend_internal,
#Fun<rabbit_auth_backend_internal.3.111050101>}]},
<<"/">>,<<>>,<0.7675.0>,
[{<<"exchange_exchange_bindings">>,bool,true},
{<<"connection.blocked">>,bool,true},
{<<"authentication_failure_close">>,bool,true},
{<<"basic.nack">>,bool,true},
{<<"publisher_confirms">>,bool,true},
{<<"consumer_cancel_notify">>,bool,true}],
none,0,1800000,#{},1000000000},
{lstate,<0.7682.0>,false},
none,82301,
{263,
[{pending_ack,82300,
<<"amq.ctag-MvTp7g5Pl6VBW1aGl2R1oQ">>,
1710247548856,
{resource,<<"/">>,queue,<<"one-3">>},
82299},
{pending_ack,82299,
<<"amq.ctag-MvTp7g5Pl6VBW1aGl2R1oQ">>,
1710247548852,
{resource,<<"/">>,queue,<<"one-3">>},
82298},
{pending_ack,82298,
<<"amq.ctag-MvTp7g5Pl6VBW1aGl2R1oQ">>,
1710247548852,
{resource,<<"/">>,queue,<<"one-3">>},
82297},
... (many more pending_ack)
Reproduction steps
I'm running chaos tests to see what kind of situations like that I can trigger. So far there are two:
osiris_replicacan fail and all of its mailbox gets logged. This is potentially a huge piece of binary data coming from a TCP. For example, after running out of disk space I get this:
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0> reason: {{badmatch,{error,enospc}},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0> [{osiris_log,write_chunk,6,[{file,"osiris_log.erl"},{line,2348}]},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0> {osiris_replica,'-handle_incoming_data/3-fun-0-',2,
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0> [{file,"osiris_replica.erl"},{line,537}]},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0> {lists,foldl,3,[{file,"lists.erl"},{line,1594}]},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0> {osiris_replica,handle_incoming_data,3,
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0> [{file,"osiris_replica.erl"},{line,536}]},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0> {gen_server,try_handle_info,3,
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0> [{file,"gen_server.erl"},{line,1095}]},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0> {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1183}]},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0> {proc_lib,init_p_do_apply,3,
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0> [{file,"proc_lib.erl"},{line,241}]}]}
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0> offender: [{pid,<0.15412.0>},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0> {id,"__stream-1_1710241152660281401"},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0> {mfargs,{osiris_replica,start_link,undefined}},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0> {restart_type,temporary},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0> {significant,false},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0> {shutdown,5000},
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0> {child_type,worker}]
2024-03-12 12:05:19.343283+00:00 [error] <0.245.0>
2024-03-12 12:05:19.129860+00:00 [error] <0.15439.0> ** Generic server <0.15439.0> terminating
2024-03-12 12:05:19.129860+00:00 [error] <0.15439.0> ** Last message in was {tcp,#Port<0.5937>,
2024-03-12 12:05:19.129860+00:00 [error] <0.15439.0> <<0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
2024-03-12 12:05:19.129860+00:00 [error] <0.15439.0> 136,0,0,1,142,50,127,114,211,0,0,0,0,0,0,0,0,0,
2024-03-12 12:05:19.129860+00:00 [error] <0.15439.0> 117,176,0,0,19,136,0,0,1,142,50,127,114,211,0,0,
with lots and lots of data later. Moreover this process is continuously restarted and therefore this data is printed over and over as the stream leader keeps trying to send the data to this replica.
- A channel can have a large state that gets printed in some cases (see the
pending_ackstuff above)
Expected behavior
General recommendations for logging:
~pshould not be used for log formatting- processes should implement format_status/2
- maximum size of the log should be configured
Additional context
No response
another one from the channel process - this time it's the mailbox, not the state:
crasher:
initial call: rabbit_channel:init/1
pid: <0.11850.0>
registered_name: []
exception exit: noproc
in function gen_server2:terminate/3 (gen_server2.erl, line 1172)
ancestors: [<0.11847.0>,<0.11845.0>,<0.11839.0>,<0.11838.0>,<0.966.0>,
<0.965.0>,<0.964.0>,<0.962.0>,<0.961.0>,rabbit_sup,
<0.262.0>]
message_queue_len: 151
messages: [{'EXIT',<0.11847.0>,shutdown},
{'$gen_cast',
{queue_event,
{resource,<<"/">>,queue,<<"q-3">>},
{deliver,<<"amq.ctag-hrgXqfnbKzvHGRhnaoLYog">>,true,
[{{resource,<<"/">>,queue,<<"q-3">>},
<15143.922.0>,2461847,false,
{mc,mc_amqpl,
{content,60,none,
<<16,0,2>>,
rabbit_framing_amqp_0_9_1,
[<<0,2,77,48,0,0,1,142,50,202,34,181,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
osiris:
** When Server state == #{name => <<"__stream-1_1710247245159438000">>,
log =>
#{file => "00000000000118320636.segment",
mode =>
#{type => acceptor,mode => write,
tail =>
{118420442,
{11,118420257,1710257805308}},
epoch => 11},
directory =>
"/var/lib/rabbitmq/mnesia/[email protected]/stream/__stream-1_1710247245159438000",
max_segment_size_chunks => 256000,
retention =>
[{max_bytes,20000000000},{max_age,600000}],
filter_size => 16,
max_segment_size_bytes => 500000000,
tracking_config => #{}},
committed_offset => 122425221,
external_reference =>
{resource,<<"/">>,queue,<<"stream-1">>},
has_parse_state => true,num_offset_listeners => 0}
** Reason for termination ==
** {{accept_chunk_out_of_order,118520700,118420442},
[{osiris_log,accept_chunk,2,[{file,"osiris_log.erl"},{line,793}]},
{osiris_replica,'-handle_incoming_data/3-fun-0-',2,
[{file,"osiris_replica.erl"},{line,537}]},
{lists,foldl,3,[{file,"lists.erl"},{line,1594}]},
{osiris_replica,handle_incoming_data,3,
[{file,"osiris_replica.erl"},{line,536}]},
{gen_server,try_handle_info,3,[{file,"gen_server.erl"},{line,1095}]},
{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1183}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]}
crasher:
initial call: osiris_replica:init/1
pid: <0.20591.0>
registered_name: []
exception exit: {accept_chunk_out_of_order,118520700,118420442}
in function osiris_log:accept_chunk/2 (osiris_log.erl, line 793)
in call from osiris_replica:'-handle_incoming_data/3-fun-0-'/2 (osiris_replica.erl, line 537)
in call from lists:foldl/3 (lists.erl, line 1594)
in call from osiris_replica:handle_incoming_data/3 (osiris_replica.erl, line 536)
in call from gen_server:try_handle_info/3 (gen_server.erl, line 1095)
in call from gen_server:handle_msg/6 (gen_server.erl, line 1183)
ancestors: [osiris_server_sup,osiris_sup,<0.239.0>]
message_queue_len: 3
messages: [{tcp,#Port<0.2098>,
<<0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
this one is somewhat different, but still worth fixing - it contains a lot of whitespace for some reason (screenshot to preserve the strange formatting):
Hi Michal, there are edge cases with unlimited retries with grace period between retries. While the problem is of course the retry itself, this also causes gigabytes of not that useful logs (I only have managed to resolve this issue by manually recreating the stream, so it tends to create gbs of logs)
...
2024-08-15 20:20:59.254000+08:00 [warning] <0.3811.0> rabbit_stream_coordinator: failed to get tail of member __static_data_17237229779233216 on rabbit@DESKTOP-R73UFKT in 2 Error: {case_clause,missing_file}
2024-08-15 20:20:59.258000+08:00 [warning] <0.3812.0> rabbit_stream_coordinator: failed to get tail of member __static_data_17237229779233216 on rabbit@DESKTOP-R73UFKT in 2 Error: {case_clause,missing_file}
2024-08-15 20:20:59.262000+08:00 [warning] <0.3813.0> rabbit_stream_coordinator: failed to get tail of member __static_data_17237229779233216 on rabbit@DESKTOP-R73UFKT in 2 Error: {case_clause,missing_file}
Feel free to ignore this post if you feel it does not belong in this issue at all. But if we have an opportunity to avoid repetitive logs like this, it might be valuable to take it on.
@sysupbda those messages can be switched to use debug. However, if they are repeated, this may indicate that a stream replica fails to recover in a certain way or wasn't deleted correctly. Please start a new discussion and attach a few hundred lines of (vetted for sensitive data) logs before the first message of this kind on all nodes. Maybe @acogoluegnes would have a guess.
Thank you for your very quick response!
I am collecting more data to do an appropriate separate post for the underlying issue. I promise. But I do think we also have a logging issue here.
Warning or error is probably appropriate for this issue as the stream is not available until the problem is resolved. That would warrant a warning, right? But more than one warning for such a repeated error bloats the logs.
This edge case, bloating the logs, forces to some logging configurations with sometimes undesirable tradeoffs. Sadly it's already happened to my clients 3 times in 3 weeks.
On Thu, 15 Aug 2024, 22:42 Michael Klishin, @.***> wrote:
@sysupbda https://github.com/sysupbda those messages can be switched to use debug. However, if they are repeated, this may indicate that a stream replica fails to recover in a certain way or wasn't deleted correctly. Please start a new discussion and attach a few hundred lines of (vetted for sensitive data) logs before the first message of this kind on all nodes. Maybe @acogoluegnes https://github.com/acogoluegnes would have a guess.
— Reply to this email directly, view it on GitHub https://github.com/rabbitmq/rabbitmq-server/issues/10723#issuecomment-2291418863, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCV3TVXT76MJVA6PMZXP6DZRS45LAVCNFSM6AAAAABESJ5QF2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJRGQYTQOBWGM . You are receiving this because you were mentioned.Message ID: @.***>
@michaelklishin , as promised I have reported the entire topic as a separate issue: https://github.com/rabbitmq/rabbitmq-server/issues/12036
@sysupbda if you run rabbitmq-diagnostics.bat status it will report a "Node data directory". That's what we need.
As discussed with @kjnilsson, the retry delay mechanism of the stream coordinator is a bit primitive and could be improved to avoid excessive logging.