MQ disconnection to be handled
When the MQ server are restarted (and it may happen), the client then is blocked. It should reconnect. @wkrzemien can you please look at this urgently ? Thanks !
Ok, I am going to look at it
@chaen @fstagni To which branch should I open the PR with the corrections? The solution I propose is the following: In case of disconnect state (e.g. MQ server restart) the client is going to try to reconnect in a loop with some time delay. I would also add a counter that after n-th tries it should stop it. Any suggestions for n , and time delay values?
v6r21. Although it will only go in one of the first patch release.
n should be huge, infinite, or configurable :-)
For delays, just take the log scale of your counter :)
I would say for example: start the delay from 2 second and increase it in each loop by 4 second
Thank you. I implemented something similar to the solution proposed here: http://jasonrbriggs.github.io/stomp.py/api.html#disconnect but unfortunately it does not cover all the cases, so I must dig more to understand what is going on.
Per chance, did you do any hotfix on lbvobox102 ? There is a strange behavior. The exception pasted bellow resulted in many many connections being opened on the ActiveMQ (they reported it)
2019-03-13 12:06:01 UTC Framework/SystemAdministrator ERROR: Exception while executing scheduled task
Traceback (most recent call last):
File "/opt/dirac/pro/DIRAC/Core/Utilities/ThreadScheduler.py", line 186, in __executeTask
task[ 'func' ]( *task[ 'args' ] )
File "/opt/dirac/pro/DIRAC/FrameworkSystem/Service/SystemAdministratorHandler.py", line 776, in __storeProfiling
gMonitoringReporter.commit()
File "/opt/dirac/pro/DIRAC/MonitoringSystem/Client/MonitoringReporter.py", line 123, in commit
result = self.processRecords()
File "/opt/dirac/pro/DIRAC/MonitoringSystem/Client/MonitoringReporter.py", line 61, in processRecords
result = createConsumer("Monitoring::Queue::%s" % self.__failoverQueueName)
File "/opt/dirac/pro/DIRAC/Resources/MessageQueue/MQCommunication.py", line 37, in createConsumer
callback = callback ) )
File "/opt/dirac/pro/DIRAC/Resources/MessageQueue/MQConsumer.py", line 22, in __init__
result = connector.subscribe( parameters = {'messengerId':self._id, 'callback':callback, 'destination':self._destination} )
File "/opt/dirac/pro/DIRAC/Resources/MessageQueue/StompMQConnector.py", line 190, in subscribe
headers=headers)
File "/opt/dirac/pro/Linux_x86_64_glibc-2.12/lib/python2.7/site-packages/stomp/protocol.py", line 417, in subscribe
self.send_frame(CMD_SUBSCRIBE, headers)
File "/opt/dirac/pro/Linux_x86_64_glibc-2.12/lib/python2.7/site-packages/stomp/protocol.py", line 250, in send_frame
self.transport.transmit(frame)
File "/opt/dirac/pro/Linux_x86_64_glibc-2.12/lib/python2.7/site-packages/stomp/transport.py", line 269, in transmit
self.send(encode(packed_frame))
File "/opt/dirac/pro/Linux_x86_64_glibc-2.12/lib/python2.7/site-packages/stomp/transport.py", line 627, in send
raise exception.NotConnectedException()
NotConnectedException
If you did not do hotfix on lbvobox102, maybe a separate issue is in order
No, I haven't touch it
It occured that in the API of stomp there are some parameters corresponding to reconnection delay, number of attempts etc. I added it and so far it passed the tests with scripts I was performing with my local RabbitMQ server. If one can have a look and review it, it would be great. It is not very long. Then I would add the docs. Meanwhile I am going to try to test it more.
Hopefully solved for producer case in #4006 . Also, the exception mentioned in previous post is catched now.
@chaen
I am able to reproduce the bug - It is indeed the race condition case which appears when more than one broker is running and the close() method is invoked. I was unable to spot it in my tests, since all are using single broker. Now I am going to cover this case too.
For the record:
- The original solution proposed here http://jasonrbriggs.github.io/stomp.py/api.html#disconnect and discussed in the forum - that I used as an example - is wrong, or at least misleading. At least in the current version of stomp.py the disconnect() method will provoque on_disconnected() call in the listener.
- IMHO the current infernal code with multiple brokers should be refactored. I have it already partly done. I understand that we cannot apply it to production now, but I would vote for the change after setting up the battery of tests in Jenkins. This is a piece of code we already now that is buggy (e.g. consumer case with subscribe), and really hard to maintain, as the current situations shows.
Just for the record. The bug has not direct connection with multiple broker case. Just accidentally my integration tests were structured in such a form that it was not possible to catch the error for single broker, while for multiple brokers it showed up immediately. So, at the end it is infinite loop when producer is closed by called close() the listener tries to reconnect() immediately as explained above. That does not change my opinion about current multiple broker code, that we should refactored - together with fixing consumer case.
Is there still something to be done for this one?
Up! Question above: is there still something to be done for this one? Or it can be closed?
So: 1.The MQ disconnection issue is solved and it has been merged and it seems to work fine 2.There are still some fixes that should be added to the consumer part. I have it already partly done it. I will prepare the separate PR for it for the intergration.
Keeping this task open for later consumer fixes, but it's not urgent anymore.
Does this task still need to be kept open now? is #4466 fixing it fully?
I think it can be closed.
I do not know if it solves it completely, since we can't reproduce it clearly
Fair point.