DIRAC icon indicating copy to clipboard operation
DIRAC copied to clipboard

MQ disconnection to be handled

Open chaen opened this issue 6 years ago • 19 comments

When the MQ server are restarted (and it may happen), the client then is blocked. It should reconnect. @wkrzemien can you please look at this urgently ? Thanks !

chaen avatar Mar 11 '19 15:03 chaen

Ok, I am going to look at it

wkrzemien avatar Mar 12 '19 07:03 wkrzemien

@chaen @fstagni To which branch should I open the PR with the corrections? The solution I propose is the following: In case of disconnect state (e.g. MQ server restart) the client is going to try to reconnect in a loop with some time delay. I would also add a counter that after n-th tries it should stop it. Any suggestions for n , and time delay values?

wkrzemien avatar Mar 12 '19 09:03 wkrzemien

v6r21. Although it will only go in one of the first patch release. n should be huge, infinite, or configurable :-) For delays, just take the log scale of your counter :)

chaen avatar Mar 12 '19 09:03 chaen

I would say for example: start the delay from 2 second and increase it in each loop by 4 second

zmathe avatar Mar 12 '19 09:03 zmathe

Thank you. I implemented something similar to the solution proposed here: http://jasonrbriggs.github.io/stomp.py/api.html#disconnect but unfortunately it does not cover all the cases, so I must dig more to understand what is going on.

wkrzemien avatar Mar 12 '19 17:03 wkrzemien

Per chance, did you do any hotfix on lbvobox102 ? There is a strange behavior. The exception pasted bellow resulted in many many connections being opened on the ActiveMQ (they reported it)

2019-03-13 12:06:01 UTC Framework/SystemAdministrator ERROR: Exception while executing scheduled task
Traceback (most recent call last):
  File "/opt/dirac/pro/DIRAC/Core/Utilities/ThreadScheduler.py", line 186, in __executeTask
    task[ 'func' ]( *task[ 'args' ] )
  File "/opt/dirac/pro/DIRAC/FrameworkSystem/Service/SystemAdministratorHandler.py", line 776, in __storeProfiling
    gMonitoringReporter.commit()
  File "/opt/dirac/pro/DIRAC/MonitoringSystem/Client/MonitoringReporter.py", line 123, in commit
    result = self.processRecords()
  File "/opt/dirac/pro/DIRAC/MonitoringSystem/Client/MonitoringReporter.py", line 61, in processRecords
    result = createConsumer("Monitoring::Queue::%s" % self.__failoverQueueName)
  File "/opt/dirac/pro/DIRAC/Resources/MessageQueue/MQCommunication.py", line 37, in createConsumer
    callback = callback ) )
  File "/opt/dirac/pro/DIRAC/Resources/MessageQueue/MQConsumer.py", line 22, in __init__
    result = connector.subscribe( parameters = {'messengerId':self._id, 'callback':callback, 'destination':self._destination} )
  File "/opt/dirac/pro/DIRAC/Resources/MessageQueue/StompMQConnector.py", line 190, in subscribe
    headers=headers)
  File "/opt/dirac/pro/Linux_x86_64_glibc-2.12/lib/python2.7/site-packages/stomp/protocol.py", line 417, in subscribe
    self.send_frame(CMD_SUBSCRIBE, headers)
  File "/opt/dirac/pro/Linux_x86_64_glibc-2.12/lib/python2.7/site-packages/stomp/protocol.py", line 250, in send_frame
    self.transport.transmit(frame)
  File "/opt/dirac/pro/Linux_x86_64_glibc-2.12/lib/python2.7/site-packages/stomp/transport.py", line 269, in transmit
    self.send(encode(packed_frame))
  File "/opt/dirac/pro/Linux_x86_64_glibc-2.12/lib/python2.7/site-packages/stomp/transport.py", line 627, in send
    raise exception.NotConnectedException()
NotConnectedException

If you did not do hotfix on lbvobox102, maybe a separate issue is in order

chaen avatar Mar 13 '19 12:03 chaen

No, I haven't touch it

wkrzemien avatar Mar 13 '19 12:03 wkrzemien

It occured that in the API of stomp there are some parameters corresponding to reconnection delay, number of attempts etc. I added it and so far it passed the tests with scripts I was performing with my local RabbitMQ server. If one can have a look and review it, it would be great. It is not very long. Then I would add the docs. Meanwhile I am going to try to test it more.

wkrzemien avatar Mar 13 '19 14:03 wkrzemien

Hopefully solved for producer case in #4006 . Also, the exception mentioned in previous post is catched now.

wkrzemien avatar Mar 30 '19 14:03 wkrzemien

@chaen
I am able to reproduce the bug - It is indeed the race condition case which appears when more than one broker is running and the close() method is invoked. I was unable to spot it in my tests, since all are using single broker. Now I am going to cover this case too. For the record:

  1. The original solution proposed here http://jasonrbriggs.github.io/stomp.py/api.html#disconnect and discussed in the forum - that I used as an example - is wrong, or at least misleading. At least in the current version of stomp.py the disconnect() method will provoque on_disconnected() call in the listener.
  2. IMHO the current infernal code with multiple brokers should be refactored. I have it already partly done. I understand that we cannot apply it to production now, but I would vote for the change after setting up the battery of tests in Jenkins. This is a piece of code we already now that is buggy (e.g. consumer case with subscribe), and really hard to maintain, as the current situations shows.

wkrzemien avatar Apr 16 '19 19:04 wkrzemien

Just for the record. The bug has not direct connection with multiple broker case. Just accidentally my integration tests were structured in such a form that it was not possible to catch the error for single broker, while for multiple brokers it showed up immediately. So, at the end it is infinite loop when producer is closed by called close() the listener tries to reconnect() immediately as explained above. That does not change my opinion about current multiple broker code, that we should refactored - together with fixing consumer case.

wkrzemien avatar Apr 27 '19 16:04 wkrzemien

Is there still something to be done for this one?

fstagni avatar May 29 '19 15:05 fstagni

Up! Question above: is there still something to be done for this one? Or it can be closed?

fstagni avatar Jul 11 '19 15:07 fstagni

So: 1.The MQ disconnection issue is solved and it has been merged and it seems to work fine 2.There are still some fixes that should be added to the consumer part. I have it already partly done it. I will prepare the separate PR for it for the intergration.

wkrzemien avatar Jul 12 '19 07:07 wkrzemien

Keeping this task open for later consumer fixes, but it's not urgent anymore.

fstagni avatar Jul 12 '19 08:07 fstagni

Does this task still need to be kept open now? is #4466 fixing it fully?

fstagni avatar Mar 03 '20 16:03 fstagni

I think it can be closed.

wkrzemien avatar Mar 04 '20 10:03 wkrzemien

I do not know if it solves it completely, since we can't reproduce it clearly

chaen avatar Mar 04 '20 12:03 chaen

Fair point.

wkrzemien avatar Mar 04 '20 12:03 wkrzemien