kombu icon indicating copy to clipboard operation
kombu copied to clipboard

Task.delay() hang forever when rabbitmq is down

Open thanhpd-teko opened this issue 3 years ago • 10 comments

Checklist

  • [x] I have verified that the issue exists against the master branch of Celery.
  • [x] This has already been asked to the discussion group first.
  • [x] I have read the relevant section in the contribution guide on reporting bugs.
  • [x] I have checked the issues list for similar or identical bug reports.
  • [x] I have checked the pull requests list for existing proposed fixes.
  • [x] I have checked the commit log to find out if the bug was already fixed in the master branch.
  • [x] I have included all related issues and possible duplicate issues in this issue (If there are none, check this box anyway).

Mandatory Debugging Information

  • [x] I have included the output of celery -A proj report in the issue. (if you are not able to do this, then at least specify the Celery version affected).
  • [x] I have verified that the issue exists against the master branch of Celery.
  • [x] I have included the contents of pip freeze in the issue.
  • [x] I have included all the versions of all the external dependencies required to reproduce this bug.

Optional Debugging Information

  • [x] I have tried reproducing the issue on more than one Python version and/or implementation.
  • [x] I have tried reproducing the issue on more than one message broker and/or result backend.
  • [x] I have tried reproducing the issue on more than one version of the message broker and/or result backend.
  • [x] I have tried reproducing the issue on more than one operating system.
  • [x] I have tried reproducing the issue on more than one workers pool.
  • [x] I have tried reproducing the issue with autoscaling, retries, ETA/Countdown & rate limits disabled.
  • [x] I have tried reproducing the issue after downgrading and/or upgrading Celery and its dependencies.

Related Issues and Possible Duplicates

Related Issues

  • None

Possible Duplicates

  • None

Environment & Settings

Celery version:

celery report Output:


Steps to Reproduce

Required Dependencies

  • Minimal Python Version: 5.1.0
  • Minimal Celery Version: 5.1.2
  • Minimal Kombu Version: N/A or Unknown
  • Minimal Broker Version: N/A or Unknown
  • Minimal Result Backend Version: N/A or Unknown
  • Minimal OS and/or Kernel Version: N/A or Unknown
  • Minimal Broker Client Version: N/A or Unknown
  • Minimal Result Backend Client Version: N/A or Unknown

Python Packages

pip freeze Output:

celery==5.1.2
django-celery-beat==2.2.1
kombu==5.1.0

Other Dependencies

N/A

Minimally Reproducible Test Case

Start rabbitmq, then execute task.delay() => ok
Stop rabbitmq, then execute task.delay() again => hang 5 mins, then raise exception 
kombu.exceptions.OperationalError: failed to resolve broker hostname
I try with config: CELERY_BROKER_TRANSPORT_OPTIONS = {"max_retries": 3, "interval_start": 0, "interval_step": 0.2, "interval_max": 0.5} but it does not work.

Expected Behavior

Raise exception in some seconds

Actual Behavior

Hang for several minutes.

thanhpd-teko avatar Aug 13 '21 09:08 thanhpd-teko

Hey @thanhpd-teko :wave:, Thank you for opening an issue. We will get back to you as soon as we can. Also, check out our Open Collective and consider backing us - every little helps!

We also offer priority support for our sponsors. If you require immediate assistance please consider sponsoring us.

By default, we retry connecting to the broker 100 times before giving up. See broker_connection_max_retries in the documentation. I'm aware this is unusually high for the producer side and this is definitely a design flaw. However, on the consumer side, we don't want to quit until we're certain the broker is down and not going to recover.

I can look into introducing a new configuration setting but in the meanwhile, you should set the broker_connection_max_retries to a lower value on the producer side.

thedrow avatar Aug 17 '21 09:08 thedrow

Hi @thedrow , Thank for your respond. I try to reduce max retries but it still does hang really longtime. It's not working. :(

app.conf.broker_connection_timeout = 1
app.conf.broker_connection_max_retries = 1

thanhpd-teko avatar Aug 19 '21 10:08 thanhpd-teko

@thedrow, one more debug information. If I change broker to redis with these options:

app.conf.broker_transport_options = {
    'max_retries': 1,
    'interval_start': 0,
    'interval_step': 0.2,
    'interval_max': 0.2,
}

app.conf.broker_url = 'redis://broker:6379'
app.conf.broker_connection_timeout = 1
app.conf.broker_connection_max_retries = 1

It will quit very shortime (1-2 seconds).

But if the broker is rabbitmq and the same configurations, it will hang very longtime. Is there any difference between 2 kinds of broker?

thanhpd-teko avatar Aug 20 '21 02:08 thanhpd-teko

This is a bug with our implementation. I'd need to try this myself to reproduce the bug.

thedrow avatar Aug 22 '21 14:08 thedrow

This bug is more about kombu rather than Celery. Kombu has two semantics when establising connection:

  1. directly failing:
>>> import kombu
>>> con = kombu.Connection('amqp://')
>>> con.connect()     # This call immediately raises exception
Traceback (most recent call last):
  File "/home/matus/dev/kombu39/lib/python3.9/site-packages/amqp/transport.py", line 172, in _connect
    entries = socket.getaddrinfo(
  File "/usr/lib/python3.9/socket.py", line 953, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -9] Address family for hostname not supported

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/matus/dev/kombu/kombu/connection.py", line 275, in connect
    return self._ensure_connection(
  ...
  File "/home/matus/dev/kombu39/lib/python3.9/site-packages/amqp/transport.py", line 197, in _connect
    self.sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
  1. blocking until broker is back again:
>>> import kombu
>>> con = kombu.Connection('amqp://')
>>> con.default_channel.      # Blocks until broker is not available

To control this behaviour transport_options parameter of Connection constructor needs to be used. The example you provided with transport_options should work at least it works for me:

>>> import kombu
>>> con = kombu.Connection('amqp://', transport_options = {'max_retries': 1,'interval_start': 0,'interval_step': 0.2,'interval_max': 0.2})
>>> con.default_channel     # Raises immediately
Traceback (most recent call last):
  File "/home/matus/dev/kombu39/lib/python3.9/site-packages/amqp/transport.py", line 172, in _connect
    entries = socket.getaddrinfo(
  File "/usr/lib/python3.9/socket.py", line 953, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -9] Address family for hostname not supported

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/matus/dev/kombu/kombu/connection.py", line 447, in _reraise_as_library_errors
    yield
...
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
...
kombu.exceptions.OperationalError: [Errno 111] Connection refused

The possible cause why it takes so long for you is that you experienced some kind of network error which causes longer time for client to detect that connection is broken. The transport options just controls the way how kombu is retrying in case of network error but still network can cause that error can be raised after longer time (some timeout in network protocol) and this cannot be fixed by kombu library.

matusvalo avatar Sep 09 '21 20:09 matusvalo

Hi @matusvalo, should I start celery task within thread, so that it won't stop the main thread? Is there any issue?

thread = Thread(target=sum.delay)
thread.start()

thanhpd-teko avatar Sep 21 '21 01:09 thanhpd-teko

Technically you can offload this to different thread but you need also to understand all details what does it mean - e.g.

  1. you need to handle blocked thread - there is no easy way to kill blocked thread.
  2. you need to manage/set daemon thread not to block main process during termination of main process etc.

As mentioned before if you don't like blocking behaviour you can set retries and timeouts accordingly.

matusvalo avatar Sep 23 '21 10:09 matusvalo

Hey @thanhpd-teko :wave:, Thank you for opening an issue. We will get back to you as soon as we can. Also, check out our Open Collective and consider backing us - every little helps!

We also offer priority support for our sponsors. If you require immediate assistance please consider sponsoring us.

Kombu exception is raised with django signals and pytest I need to know a possible fix for the errors so I can stop running iinto it

iamunadike avatar Jul 15 '22 19:07 iamunadike