rabbitmq-server icon indicating copy to clipboard operation
rabbitmq-server copied to clipboard

Ensure blank member state is always returned when there are no member states to avoid crashing the gm process

Open Ayanda-D opened this issue 1 year ago • 9 comments

Proposed Changes

Hi Rabbit Team. We are still heavily using CMQs and there are couple crashes we want to iron out while 3.12.x series is still active (QQ adoption slowly underway). This PR is a fix to GM's find_member_or_blank/2 when there isn't existing members state defined (rare case, but often manifests), which causes a ton of crashes illustrated below. We want to guarantee that blank member state is always returned in such cases, to avoid leaving queues in a very bad state.

2024-01-13 17:35:33.614 [error] <0.8801.670> ** Generic server <0.8801.670> terminating
** Last message in was {'$gen_cast',{'$gm',4029,{catchup,{4026,<5817.3973.1732>},[{{4008,<5819.16632.0>},{member,{[],[]},17,17}},{{4025,<5818.5272.6>},{member,{[],[]},0,0}},{{4026,<0.8801.670>},{member,{[],[]},-1,-1}},{{4026,<5817.3973.1732>},{member,{[],[]},-1,-1}}]}}}
** When Server state == {state,{4026,<0.8801.670>},{{4025,<5818.5272.6>},#Ref<0.1395141190.206569492.44498>},{{4025,<5818.5272.6>},#Ref<0.1395141190.206569492.44499>},{resource,<<"xxxxxxx">>,queue,<<"xxxxxxx">>},rabbit_mirror_queue_slave,{4028,#{{4025,<5818.5272.6>} => {view_member,{4025,<5818.5272.6>},[],{4026,<0.8801.670>},{4026,<0.8801.670>}},{4026,<0.8801.670>} => {view_member,{4026,<0.8801.670>},[{4008,<5819.16632.0>}],{4025,<5818.5272.6>},{4025,<5818.5272.6>}}}},-1,undefined,[<0.1904.670>],{[],[]},[],0,undefined,#Ref<0.1395141190.206569492.44500>,fun rabbit_misc:execute_mnesia_transaction/1,false}
** Reason for termination ==
** {{badmap,undefined},[{maps,find,[{4026,<5817.3973.1732>},undefined],[]},{gm,find_member_or_blank,2,[{file,"src/gm.erl"},{line,1379}]},{gm,'-remove_erased_members/2-fun-0-',3,[{file,"src/gm.erl"},{line,1406}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{gm,handle_cast,2,[{file,"src/gm.erl"},{line,639}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1050}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
2024-01-13 17:35:33.614 [error] <0.19147.667> Supervisor {<0.19147.667>,rabbit_amqqueue_sup} had child rabbit_amqqueue started with rabbit_prequeue:start_link({amqqueue,{resource,<<"xxxxx">>,queue,<<"xxxxx">>},true,false,none,...}, slave, <0.19127.667>) at <0.8628.670> exit with reason {{badmap,undefined},[{maps,find,[{26,<5817.9014.1732>},undefined],[]},{gm,find_member_or_blank,2,[{file,"src/gm.erl"},{line,1379}]},{gm,'-remove_erased_members/2-fun-0-',3,[{file,"src/gm.erl"},{line,1406}]},{lists,foldl,3,[{file,"lists.erl"},{line,1263}]},{gm,handle_cast,2,[{file,"src/gm.erl"},{line,639}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1050}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]} in context child_terminated

Types of Changes

What types of changes does your code introduce to this project? Put an x in the boxes that apply

  • [x] Bug fix (non-breaking change which fixes issue #NNNN)
  • [ ] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • [ ] Documentation improvements (corrections, new content, etc)
  • [ ] Cosmetic change (whitespace, formatting, etc)
  • [ ] Build system and/or CI

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask on the mailing list. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

  • [x] I have read the CONTRIBUTING.md document
  • [x] I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • [ ] All tests pass locally with my changes
  • [ ] If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • [ ] If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution you did and what alternatives you considered, etc.

Ayanda-D avatar Jan 17 '24 15:01 Ayanda-D

Have you considered why it enters that function clause? Looking at the gm module, it seems that if members_state is not initialised it should not handle that message, as it does for check_neighbours or add_on_right. This patch fixes the crash in the logs, but I don't think it addresses the root cause.

dcorbacho avatar Jan 17 '24 15:01 dcorbacho

@dcorbacho we don't know the root cause (and we cant reproduce the cause), but some critical queues go down quite badly, hence attempt to keep the queue running and deal with data consistency separately on an active queue. For this case though, it also looks OK to return blank state, seeing its only on attempt to remove_erased_members from the members state.

Ayanda-D avatar Jan 17 '24 16:01 Ayanda-D

@Ayanda-D This patch is initialising members_state through this clause https://github.com/rabbitmq/rabbitmq-server/blob/4513156eaab49950b51e5219a16e4b6dc1185476/deps/rabbit/src/gm.erl#L639 and executing the whole https://github.com/rabbitmq/rabbitmq-server/blob/4513156eaab49950b51e5219a16e4b6dc1185476/deps/rabbit/src/gm.erl#L625.

That means the gm is being initialised through that message, which might not be right. It seems join is the right way to initialise it: https://github.com/rabbitmq/rabbitmq-server/blob/4513156eaab49950b51e5219a16e4b6dc1185476/deps/rabbit/src/gm.erl#L670

What if that message should never be handled? Have you consider that? I don't remember the whole reasoning behind the gm but it is complex, difficult to understand and prone to errors. I would like an explanation of why that whole clause should be executed and the side effects of doing it. It seems that gm has not yet received the join command.

dcorbacho avatar Jan 17 '24 16:01 dcorbacho

The fact that check_neighbours , broadcast and all other GM primitives intentionally handle members_state = undefined and dont even allow processing it further is clearly by design. undefined state is permitted here

Regarding your question:

What if that message should never be handled?

I don't want to change/temper with the GM protocol itself, i.e. altering "what messages should/should not be processed". That could have more severe side effects on queues. We only want to fix the queue terminations which are impacting service quite badly.

Ayanda-D avatar Jan 17 '24 16:01 Ayanda-D

clearly by design, where is that design? The reason we are replacing CMQ is because they're broken by design. This code hasn't been touched in many years and no other users have reported this as far as I am aware. If we allow that gm process to continue, how do we know that is not in a broken state? Are those queues correct and fully functional afterwards? Can you demonstrate it? Can you explain why that message is received in that state? Why should it be processed? If it was expecting an undefined, I think we should have seen this crash before.

As I stated before, the gm is too complex and it has been really problematic in the past. We would appreciate if you could investigate further and explain in which circumstances this message is received and why it should be handled in this way.

dcorbacho avatar Jan 17 '24 17:01 dcorbacho

MemberState being undefined state is just normal behaviour - when the gm process starts up, member_state is first and foremost initialized in undefined state https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbit/src/gm.erl#L546

When the async join message issued on init is handled, MemberState can remain in undefined state if alive_view_members(View) =/= [Self] : https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbit/src/gm.erl#L678-L682

So a GM process executing with MemberState = undefined is clearly part of the design. This quite obvious.

The purpose of find_member_or_blank/2 is to return a blank member state if there's an error https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbit/src/gm.erl#L1379

We are just doing the same here. Whether MembersState is an empty-map or the member is not found in the currant member-state, OR in this case which we're now handling - if the member-state is still in undefined state from GM's initialization phase. Return a blank-state.

Regardless of the error in find_member_or_blank/2 MemberState, if there's an error finding a member, it is initialized with blank-state - and member stored in new blank state https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbit/src/gm.erl#L1404-L1405

Part of the purpose of the GM is to keep a membership view which it dynamically updates and corrects based on the state of the cluster. Thats the purpose of the catchup call which rebuilds the member-state based on how the state ring. https://github.com/rabbitmq/rabbitmq-server/blob/main/deps/rabbit/src/gm.erl#L814.

The GM attempts to correct the view of members dynamically - if the cluster/mirroring view is off, and member-state is not consistent, this is normal e.g. in case of partial partitions, attempts to restore/rebuild the consistent state are already taken care of.

So I'm not sure what the exactly the concern is here with adding error handling that prevents/protects ultimate crashes and queues being unavailable. A GM process running with a MemberState undefined is very normal following initialization and on handling join. Just not handled in this find_member_or_blank/2 - we dont always see this, but it does happen time to time and is now causing serious problems on our prod environments.

Ayanda-D avatar Jan 18 '24 11:01 Ayanda-D

@Ayanda-D there are two specific concerns:

  1. gm is possibly the most complex and at the same time, poorly documented/tested module in all of RabbitMQ's history. We are trying to pretend we understand the side effects of this change. Maybe you do but not several senior members of the core team
  2. Classic mirrored queues have been deprecated for several years. gm is about to go up in flames together with CQ mirroring, possibly as early as Feb or Mar, after 3.13 ships. We would very much prefer to not touch it for that basic reasons.

michaelklishin avatar Jan 18 '24 18:01 michaelklishin

This is a code path that is only affected when the edge case resulting in the crash occurs (so very low frequency of occurrence). But when it does occur, the worst case is already happening (lots of queues are crashing), so we don't see any benefit of leaving this part of the gm still susceptible to the same crashes that are causing outages, without any attempt of mediating the problem, on reasons of poor documentation, deprecation, etc.

We've ignored this issue before, but only until last weekend, when our engineers spent hours last Saturday trying to restore things. Hence initiative to at least fix and prevent queues from crashing and causing same headaches. All mirroring and unit_gm_SUITE tests pass. So there's nothing broken/affected by this patch.

We still heavily rely on CMQs and having this upstream means we dont have to manually patch each new release with CMQs which we use. But its your call at EOD. The most the community can do is contribute to the product to the best of our ability. Whichever direction you choose to take with this, we'll adjust/adapt on our end :-D Cheers 👍

Ayanda-D avatar Jan 19 '24 12:01 Ayanda-D

if its of any use here, just fyi, we've added logging on our older forks which we maintain, to expose these gm error state(s), while not crashing queues / relying on queue crash logs.

Ayanda-D avatar Jan 22 '24 12:01 Ayanda-D