django-q2 icon indicating copy to clipboard operation
django-q2 copied to clipboard

qcluster stuck forever when initial connection failed in ORM broker

Open ntap-fge opened this issue 7 months ago • 0 comments

When starting qcluster the following happens:

  • Cluster.start() starts a new process, calls Sentinel() in it and waits until Sentinel emits start_event
  • Sentinel() instantiates the broker though get_broker()
  • The broker __init__() calls get_connection()

If the connection attempt fails in the ORM broker, get_connection() raises an exception. As a result the Sentinel process dies and the main process waits forever for start_event. There is not indication from the outside (besides the log entries) that the qcluster is permanently non-functional.

The root cause seems to be that ORM.__init__() through get_connection() actually tries to establish a connection. Redis on the other hand seems to only setup the client without any network connection. In the best case this is unnecessary overhead. self.connection is never used and the constructor is called from the Sentinel process so the pusher process needs to establish a new connection anyway. In the worst case the above happens.

I think ideally Broker.connection should be initialized lazily. That would generally reduce the amount of code that is run in the Sentinel process.

ntap-fge avatar Jul 12 '24 08:07 ntap-fge