fix #254 Channel.Publish blocks indefinitely
I encountered a bug similar to this comment: https://github.com/streadway/amqp/issues/254#issuecomment-316401642
However, in a weak network environment, I get a deadlock after N+1 messages have been sent where N is the size of the buffered channel.
How to Reproduce:
- weak network enviroment (I was using
Network Link ConditionerwithVery Bad Networkconfig) - An external public network rabbitmq
- 10 goroutines sending messages to rabbitmq (I was using this library https://github.com/shanbay/gobay/blob/v0.15.0/extensions/busext/amqp.go#L127)
Then it will deadlock.
The sequence of events is:
- Receive an old message's ack ->
One()->resequence()->confirm()send ack to channel - Receive another old message's ack ->
One() get lock->resequence()->confirm()send ack to channel -> block (channel buffer == 1) (and lockconfirms.mis not released) - I am sending a message using
Publish-> Failed to get lock
This time it will keep blocking; simply because the buffer is 1 but two acks are received at the same time (although I sent messages serially, we can indeed reproduce this extreme case).
I don't think Publish needs to use the same lock as the other methods, it's just a self-incrementing. Even if the program is in resequence(), there may not be a problem with Publish() running at this point.
I have had a very similar issue lately. There isn't anything special about the reproducing, there would just be a random deadlock which would prevent all the other Publish to this channel. It would lock up for at least an hour (didn't check longer time periods, but I am pretty sure that the deadlock would still be there).
However, there was still one weird problem. At about the time when the deadlock appeared, I could see the following log in RabbitMQ Container: [error] <...> closing AMQP connection <...>, so I am assuming that some connection problems could have caused this mixup in acks and lead to deadlock.
I completely agree with this pull request as I think that semantically Publish has to have another lock since it is just an increment which can be done more efficiently as well as more clearly with atomic operation.
@michaelklishin @streadway
same question here. @michaelklishin @streadway