rabbitmq-dotnet-client
rabbitmq-dotnet-client copied to clipboard
Exception during recovery causes recovery failure
I've run into an issue where an exception happens after recovery which causes the connection to be closed and no attempt to recover.
As far as I can tell from logs this is what's happening
RecoverySucceededevent raisedModelShutdownevent raised with close reason 530 NOT_ALLOWED- Connection is closed but
ConnectionShutdownevent is not raised
From a quick look over the code it seems like the problem is here:
https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/015c51761deabfebe03e3bff2c4eb4e2ab53a072/projects/client/RabbitMQ.Client/src/client/impl/Connection.cs#L575-L586
There is already a close reason set but there is another exception happening
Here is the full exception taken from m_shutdownReport
Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.
at RabbitMQ.Client.Impl.InboundFrame.ReadFrom(NetworkBinaryReader reader)
at RabbitMQ.Client.Framing.Impl.Connection.MainLoopIteration()
at RabbitMQ.Client.Framing.Impl.Connection.ClosingLoop()
We cannot draw any conclusions with a single stack trace. Please collect and share server logs and a traffic capture. If you believe you have a decent understanding of the problem and can reproduce it, consider submitting a pull request.
Note that connection recovery has been simplified just a few days ago in https://github.com/rabbitmq/rabbitmq-dotnet-client/pull/656, too.
Seems that neither old, neither new version of connection recovery actually solves the problem if exceptions occurs while trying to recover topology. This only logs an exception, but does nothing to recover the consumer. But if client tries to reconnect once again - then recoverry will be triggered once again.
Probably the simplest way to reproduce is to have cluster with durable, but not HA 2 queues, where each of queue would live on different node. If we stop one of the node - client will reconnect to other node, but will not be able to start consuming messages from the queue, which has no master node right now, and will throw an exception. Expected behavior would be that consumer will try to recover it forever (like with connection recovery). In this case - until node, which owns that queue would come back online.
What this means - that we cannot trust topology recovery, and instead we need to implement our own wrapper around it to recover it.
You can actually check the code: https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/v5.1.1/projects/client/RabbitMQ.Client/src/client/impl/AutorecoveringConnection.cs#L892 - https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/master/projects/client/RabbitMQ.Client/src/client/impl/AutorecoveringConnection.cs#L800 RecordedConsumer.Recover throws an exception - that consumer is not tried to recover once again -as the exception will be only logged, but nothing else will be done with it. And actually there is no way to intercept that exception and act on it.
This library cannot know how it should recover from topology recovery failures. It works very well for a pretty significant number of users. The docs do not promise that it will cover every case.
A contribution that makes it possible to react to topology exceptions would be considered.
Retry logic and filtering for recovery has been added in the Java client. Even though this is not a trivial task, a PR based on the Java implementation would be welcome.
This will be addressed by #1312
@rosca-sabina @mikenorgate
- https://github.com/rabbitmq/rabbitmq-dotnet-client/releases/tag/v6.5.0
- https://www.nuget.org/packages/RabbitMQ.Client/6.5.0