rabbitmq-server
rabbitmq-server copied to clipboard
`khepri_db`: `function_clause` in `rabbit_federation_exchange_link_sup_sup` on network disconnect
Describe the bug
Disconnecting the network to one node of a 3-node khepri-enabled cluster eventually results in a strange function_clause error:
rmq0-function_clause-stack.txt
The error also originates from the rabbit_federation_queue_link_sup_sup process as well. My test project enables the rabbitmq_federation plugin, but does not create any federation links.
Reproduction steps
- Start cluster
git clone [email protected]:lukebakken/docker-rabbitmq-cluster.git cd docker-rabbitmq-cluster git checkout khepri make DOCKER_FRESH=true clean up - Disconnect node
rmq0docker network disconnect rabbitnet docker-rabbitmq-cluster-rmq0-1 - Watch logs until
function_clauseerror happens
Expected behavior
No error.
Additional context
This does not appear to affect the normal operation of PerfTest.
In addition, the following log lines appear:
rmq2-1 | 2024-09-11 00:29:10.084227+00:00 [error] <0.181.0>
rmq2-1 | 2024-09-11 00:29:10.084227+00:00 [error] <0.181.0> ** Cannot get connection id for node '[email protected]'
rmq2-1 | 2024-09-11 00:29:10.084227+00:00 [error] <0.181.0>
rmq1-1 | 2024-09-11 00:29:10.096091+00:00 [error] <0.181.0>
rmq1-1 | 2024-09-11 00:29:10.096091+00:00 [error] <0.181.0> ** Cannot get connection id for node '[email protected]'
rmq1-1 | 2024-09-11 00:29:10.096091+00:00 [error] <0.181.0>
These log lines originate in OTP itself:
lbakken@shostakovich ~/development/erlang/otp (master =)
$ git grep -i 'cannot get connection'
lib/kernel/src/net_kernel.erl:1051: error_logger:error_msg("~n** Cannot get connection id for node ~w~n",
lib/kernel/src/net_kernel.erl:1156: error_logger:error_msg("~n** Cannot get connection id for node ~w~n",
lib/kernel/src/net_kernel.erl:1545: error_logger:error_msg("~n** Cannot get connection id for node ~w~n",
What's odd is that the error messages originate from the node to which the error message refers 🤔
The rabbit_db_msup module and its callers will need some updates to handle potential timeouts when interacting with Khepri like in https://github.com/rabbitmq/rabbitmq-server/pull/11785
The changes will probably be trickier for this module since the commands don't come from a user so it's not a simple matter of bubbling up and returning an error.
I just hit that with rabbit_shovel_dyn_worker_sup_sup, which makes sense, since it's also a mirrored supervisor.
@the-mikedavis no, we can still bubble up an error. Shovel will then log it and restart. With Shovels, these "failure loops" is how the errors are communicated since this is a non-interactive client by definition.
I believe it has been fixed by #12853 which handles the possible timeout from Khepri and retries. This was worked on as part of a failure in CI and it looks really close to this issue description.
Were anyone able to reproduce recently?
Yeah, I think these are the same issue. I will mark it as resolved and we can reopen it if we discover they are different.
Fixed by #12853.