aiida-core icon indicating copy to clipboard operation
aiida-core copied to clipboard

AiiDA will no longer work with rabbitmq>3.7 by default

Open chrisjsewell opened this issue 4 years ago • 17 comments

In https://github.com/rabbitmq/rabbitmq-server/pull/2990 a consumer_timeout has been introduced and set to 15 minutes, meaning that any process task that takes longer than 15 minutes will be cancelled 😬 (there is people in that PR none too happy that this was introduced in a minor version)

The quick fix for this for users is either (a) use rabbitmq 3.7 or lower, or (b) configure consumer_timeout to false. (see also https://www.rabbitmq.com/consumers.html#acknowledgement-timeout)

As is literally the last comment in that PR, at the time of writing, it is unclear to me off-hand if this can be done using the API (i.e. something aiida-core can handle automatically)?

chrisjsewell avatar Aug 30 '21 13:08 chrisjsewell

I feel maybe we can put this in the broker_parameters: https://github.com/aiidateam/aiida-core/blob/4174e5de3adbeec785290a02a0fc78d4597e42e0/aiida/manage/configuration/schema/config-v5.schema.json#L322

Two question:

  1. will rabbitmq<3.8 complain if passed a parameter that it does not know?
  2. Can we actually set the default as false; in the documentation it implies it has to be an integer, but in the PR they specifically mention false https://github.com/rabbitmq/rabbitmq-server/pull/2990#issuecomment-846033907

thoughts @sphuber?

chrisjsewell avatar Aug 30 '21 13:08 chrisjsewell

trying it out in #5106

chrisjsewell avatar Aug 30 '21 13:08 chrisjsewell

I remember looking into the default timeouts a long time ago and I think it is not a value that can be configured from the client. This has to be configured on the server itself. There even was a maximum defined that could not be surpassed. So even if you put a value above it in the config, it would be capped at the hardcoded value. This may have been for older versions of RabbitMQ (around 3.5) and not sure if that is still there. All there reasoning is that the main use case for RabbitMQ is that these should be "quick" jobs on the order of seconds.

sphuber avatar Aug 30 '21 13:08 sphuber

yeh cheers #5106 does not appear to fail rabbitmq, but obviously no idea yet if it is actually having any affect

chrisjsewell avatar Aug 30 '21 13:08 chrisjsewell

Hmm, yeh no joy yet; trying to set consumer_timeout to 1 in #5106, but that doesn't seem to fail anything

chrisjsewell avatar Aug 30 '21 14:08 chrisjsewell

Yeh no I guess it is not part of https://www.rabbitmq.com/uri-query-parameters.html#tls 😒

I asked about adding it: https://github.com/rabbitmq/rabbitmq-server/pull/2990#issuecomment-908405800, or maybe I should open an actual issue if they don't respond

chrisjsewell avatar Aug 30 '21 14:08 chrisjsewell

Ok opened: https://github.com/rabbitmq/rabbitmq-server/issues/3344 🤞

chrisjsewell avatar Aug 30 '21 15:08 chrisjsewell

Ok opened: rabbitmq/rabbitmq-server#3344 🤞

Well that was a dead end (we kinda use rabbitmq in a way it is not designed for)

So why don't we just remove it entirely 😉 https://github.com/chrisjsewell/aiida-process-coordinator/discussions/4

chrisjsewell avatar Sep 06 '21 12:09 chrisjsewell

I just had the same issue - Channel closed error for something running > 30 minutes. I checked and indeed I have rabbitmq 3.8.16. We'll probably need to focus on replacing rmq as soon as 2.0 is out... However, I'm sure many people will have this error in 2.0 as now recent versions are >3.7.

Can we make this requirement more obvious? E.g. check in verdi status and print an error that the version of RMQ is not supported and one has to downgrade, at least for the time being?

giovannipizzi avatar Dec 16 '21 17:12 giovannipizzi

Adding link to another project encountering the same issue: https://github.com/celery/celery/issues/6760

chrisjsewell avatar Dec 28 '21 13:12 chrisjsewell

After accidentally getting my rabbitmq updated to 3.9.x I also faced this same issue. And I would like to point out that the simplest way to downgrade rabbitmq would be to use conda instead of debian package. Otherwise one needs to manually downgrade all dependencies like erlang which has its own dependencies and it creates a big mess.

So for anyone stumbling here, running following is all that's required.

conda install -c conda-forge rabbitmq-server=3.7.28

Maybe @giovannipizzi @chrisjsewell we can add this in the wiki where you discuss this issue?

tsthakur avatar Aug 19 '22 16:08 tsthakur

yeh, as we have just been discussing, I think it is a nicer solution, in terms of dependency management (as opposed to apt or homebrew), but the downside is no automated setup of a background service, using e.g. launchctl (osx), systemd (linux)

Out of interest, I have just posted here, to ask about such a feature https://groups.google.com/a/anaconda.com/g/anaconda/c/z36jZTlJG8g

chrisjsewell avatar Aug 19 '22 17:08 chrisjsewell

I've just had the issue with the channel closed error, while running the RabbitMQ v3.9.13. I have increased the consumer_timeout as per the documentation, but the jobs crashed after about 5 hours. I have some even older jobs running now, so I'm not sure if this is related to the timeout.

Going through the RabbitMQ documentation, I have noticed a possible mistake in the Aiida documentation. It suggests:

# 100 hours in milliseconds (increase if you expect your workflows to run longer)
consumer_timeout = 3600000

however this appears to actually correspond to 1 hour, which is also what the RabbitMQ documentation says.

Zeleznyj avatar Nov 01 '22 14:11 Zeleznyj

Thanks for the report @Zeleznyj . Indeed, our wiki is incorrect and that is one hour, which would explain the error. Could you try to up it to lets say 3600000000 (a 1000 hours, just to be on the safe side) and restart the RabbitMQ service? Make sure to stop the daemon first and restart it when RabbitMQ is back up and running.

I will update the wiki now.

sphuber avatar Nov 01 '22 15:11 sphuber

I have tried increasing it, let's see if that helps, but the error is clearly somewhat random.

I have encountered the error before and thought it's related to this since I'm running Aiida on laptop, but this time the computer was on the whole time the jobs were running.

Zeleznyj avatar Nov 01 '22 15:11 Zeleznyj

Has anyone ever tried using the advanced.config to disable the timeout completely? The documentation (https://www.rabbitmq.com/consumers.html#acknowledgement-timeout) specifies that this should be possible by adding the following to a file named advanced.config:

%% advanced.config
[
  {rabbit, [
    {consumer_timeout, undefined}
  ]}
].

ahkole avatar Jan 27 '23 14:01 ahkole

@ahkole I tried RabbitMQ 3.11.4 with the advanced config:

cat > ~/rabbitmq.notimeout.advanced.config <<EOF 
%% advanced.config
[
  {rabbit, [
    {consumer_timeout, undefined}
  ]}
].
EOF
export RABBITMQ_ADVANCED_CONFIG_FILE=~/rabbitmq.notimeout.advanced.config
rabbitmq-server

and everything worked as expected

rikigigi avatar Mar 22 '23 13:03 rikigigi