rabbitmq-dotnet-client Exception during recovery causes recovery failure

I've run into an issue where an exception happens after recovery which causes the connection to be closed and no attempt to recover.

As far as I can tell from logs this is what's happening

RecoverySucceeded event raised
ModelShutdown event raised with close reason 530 NOT_ALLOWED
Connection is closed but ConnectionShutdown event is not raised

From a quick look over the code it seems like the problem is here:

https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/015c51761deabfebe03e3bff2c4eb4e2ab53a072/projects/client/RabbitMQ.Client/src/client/impl/Connection.cs#L575-L586

There is already a close reason set but there is another exception happening

Here is the full exception taken from m_shutdownReport

Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.
   at RabbitMQ.Client.Impl.InboundFrame.ReadFrom(NetworkBinaryReader reader)
   at RabbitMQ.Client.Framing.Impl.Connection.MainLoopIteration()
   at RabbitMQ.Client.Framing.Impl.Connection.ClosingLoop()

Sep 17 '19 14:09 mikenorgate

We cannot draw any conclusions with a single stack trace. Please collect and share server logs and a traffic capture. If you believe you have a decent understanding of the problem and can reproduce it, consider submitting a pull request.

Note that connection recovery has been simplified just a few days ago in https://github.com/rabbitmq/rabbitmq-dotnet-client/pull/656, too.

Sep 17 '19 14:09 michaelklishin

Seems that neither old, neither new version of connection recovery actually solves the problem if exceptions occurs while trying to recover topology. This only logs an exception, but does nothing to recover the consumer. But if client tries to reconnect once again - then recoverry will be triggered once again.

Probably the simplest way to reproduce is to have cluster with durable, but not HA 2 queues, where each of queue would live on different node. If we stop one of the node - client will reconnect to other node, but will not be able to start consuming messages from the queue, which has no master node right now, and will throw an exception. Expected behavior would be that consumer will try to recover it forever (like with connection recovery). In this case - until node, which owns that queue would come back online.

What this means - that we cannot trust topology recovery, and instead we need to implement our own wrapper around it to recover it.

You can actually check the code: https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/v5.1.1/projects/client/RabbitMQ.Client/src/client/impl/AutorecoveringConnection.cs#L892 - https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/master/projects/client/RabbitMQ.Client/src/client/impl/AutorecoveringConnection.cs#L800 RecordedConsumer.Recover throws an exception - that consumer is not tried to recover once again -as the exception will be only logged, but nothing else will be done with it. And actually there is no way to intercept that exception and act on it.

Nov 22 '19 09:11 Lashas83

This library cannot know how it should recover from topology recovery failures. It works very well for a pretty significant number of users. The docs do not promise that it will cover every case.

A contribution that makes it possible to react to topology exceptions would be considered.

Nov 26 '19 08:11 michaelklishin

Retry logic and filtering for recovery has been added in the Java client. Even though this is not a trivial task, a PR based on the Java implementation would be welcome.

Dec 02 '19 09:12 acogoluegnes

This will be addressed by #1312

Mar 17 '23 17:03 lukebakken

@rosca-sabina @mikenorgate

https://github.com/rabbitmq/rabbitmq-dotnet-client/releases/tag/v6.5.0
https://www.nuget.org/packages/RabbitMQ.Client/6.5.0

Mar 25 '23 15:03 lukebakken

rabbitmq-dotnet-client rabbitmq-dotnet-client copied to clipboard

Exception during recovery causes recovery failure

rabbitmq-dotnet-client
rabbitmq-dotnet-client copied to clipboard