rabbitmq-server icon indicating copy to clipboard operation
rabbitmq-server copied to clipboard

`khepri_db`: `function_clause` in `rabbit_federation_exchange_link_sup_sup` on network disconnect

Open lukebakken opened this issue 1 year ago • 3 comments

Describe the bug

Disconnecting the network to one node of a 3-node khepri-enabled cluster eventually results in a strange function_clause error:

rmq0-function_clause-stack.txt

The error also originates from the rabbit_federation_queue_link_sup_sup process as well. My test project enables the rabbitmq_federation plugin, but does not create any federation links.

Reproduction steps

  • Start cluster
    git clone [email protected]:lukebakken/docker-rabbitmq-cluster.git
    cd docker-rabbitmq-cluster
    git checkout khepri
    make DOCKER_FRESH=true clean up
    
  • Disconnect node rmq0
    docker network disconnect rabbitnet docker-rabbitmq-cluster-rmq0-1
    
  • Watch logs until function_clause error happens

Expected behavior

No error.

Additional context

This does not appear to affect the normal operation of PerfTest.

In addition, the following log lines appear:

rmq2-1       | 2024-09-11 00:29:10.084227+00:00 [error] <0.181.0>
rmq2-1       | 2024-09-11 00:29:10.084227+00:00 [error] <0.181.0> ** Cannot get connection id for node '[email protected]'
rmq2-1       | 2024-09-11 00:29:10.084227+00:00 [error] <0.181.0>
rmq1-1       | 2024-09-11 00:29:10.096091+00:00 [error] <0.181.0>
rmq1-1       | 2024-09-11 00:29:10.096091+00:00 [error] <0.181.0> ** Cannot get connection id for node '[email protected]'
rmq1-1       | 2024-09-11 00:29:10.096091+00:00 [error] <0.181.0>

These log lines originate in OTP itself:

lbakken@shostakovich ~/development/erlang/otp (master =)
$ git grep -i 'cannot get connection'
lib/kernel/src/net_kernel.erl:1051:            error_logger:error_msg("~n** Cannot get connection id for node ~w~n",
lib/kernel/src/net_kernel.erl:1156:                error_logger:error_msg("~n** Cannot get connection id for node ~w~n",
lib/kernel/src/net_kernel.erl:1545:                    error_logger:error_msg("~n** Cannot get connection id for node ~w~n",

What's odd is that the error messages originate from the node to which the error message refers 🤔

lukebakken avatar Sep 11 '24 00:09 lukebakken

The rabbit_db_msup module and its callers will need some updates to handle potential timeouts when interacting with Khepri like in https://github.com/rabbitmq/rabbitmq-server/pull/11785

The changes will probably be trickier for this module since the commands don't come from a user so it's not a simple matter of bubbling up and returning an error.

the-mikedavis avatar Sep 11 '24 15:09 the-mikedavis

I just hit that with rabbit_shovel_dyn_worker_sup_sup, which makes sense, since it's also a mirrored supervisor.

mkuratczyk avatar Sep 11 '24 20:09 mkuratczyk

@the-mikedavis no, we can still bubble up an error. Shovel will then log it and restart. With Shovels, these "failure loops" is how the errors are communicated since this is a non-interactive client by definition.

michaelklishin avatar Sep 30 '24 15:09 michaelklishin

I believe it has been fixed by #12853 which handles the possible timeout from Khepri and retries. This was worked on as part of a failure in CI and it looks really close to this issue description.

Were anyone able to reproduce recently?

dumbbell avatar Apr 23 '25 11:04 dumbbell

Yeah, I think these are the same issue. I will mark it as resolved and we can reopen it if we discover they are different.

Fixed by #12853.

dumbbell avatar Apr 23 '25 13:04 dumbbell