girder_worker icon indicating copy to clipboard operation
girder_worker copied to clipboard

Check if Celery Pinning is still necessary

Open kotfic opened this issue 5 years ago • 23 comments

@manthey I seem to recall we pinned celery to celery>=4,<4.2 because some of the things you were working on were having issues with connection pool dropping? Do you think you could see if this is still an issue with 4.2.1?

kotfic avatar Feb 18 '19 14:02 kotfic

It looks like celery master has this fixed, but it is listed as targeted for v4.3 (https://github.com/celery/celery/issues/4867).

manthey avatar Feb 18 '19 15:02 manthey

I wouldn't expect a v4.3 release anytime soon, see https://github.com/celery/celery/issues/4957. In other projects I've just had to pick a sha from master and pin to it.

danlamanna avatar Feb 18 '19 17:02 danlamanna

Well that aged well: https://github.com/celery/celery/releases/tag/v4.3.0rc1.

danlamanna avatar Feb 26 '19 19:02 danlamanna

I've run into this problem again with celery 4.3. It does seem resolved in 4.4.0rc3. I think we should re-pin to celery <4.2 and >= 4.4.

manthey avatar Nov 04 '19 13:11 manthey

@manthey are there any open issues to indicate other people are seeing the issue with 4.3? I'd like to avoid pinning unless it's clear that 4.4 is going to fix a systemic problem with 4.3 related to connection pool dropping.

kotfic avatar Nov 07 '19 14:11 kotfic

The celery issue https://github.com/celery/celery/issues/4867 has several reports of this still being a problem in 4.3.

In working with slicer_cli_web and girder_worker, on celery 4.3.0 I get the problem "often" -- seemingly more than half the time I initiate a second job. On celery 4.4.0rc3, I could not produce the problem.

We shouldn't have unpinned this without verifying that it was fixed -- I only didn't notice because the project I am using girder_worker on was using an older version where the pinned celery version was in place.

manthey avatar Nov 07 '19 14:11 manthey

Is this an issue that can be resolved downstream at build/deploy time? I would be hesitant to forcibly revert celery versions across all downstream projects.

kotfic avatar Nov 07 '19 15:11 kotfic

celery 4.2 and 4.3 are broken using the rabbitmq broker. We recommend rabbitmq in the readthedocs. This would suggest either pinning celery or having a caveat about using rabbitmq in the docs. I don't know what projects are using this successfully. Are they using a different broker?

manthey avatar Nov 07 '19 16:11 manthey

To be clear, Celery 4.2 and 4.3 - in conjunction with an unknown set of deployment configurations - are showing intermittent errors for some users. Until it is resolved in Celery, it should be mitigated in downstream projects that are experiencing the issue.

kotfic avatar Nov 07 '19 17:11 kotfic

Right. Let's pin celery to mitigate the problem.

manthey avatar Nov 07 '19 18:11 manthey

Projects that somehow work despite the broken version of celery could override this -- our default installation should work rather than be broken.

manthey avatar Nov 07 '19 18:11 manthey

I don't know that I have followed this completely... if I understand correctly, the problem is that if you pin celery here, you can't override it downstream. When you try to import, python will throw a VersionConflict exception.

jbeezley avatar Nov 07 '19 18:11 jbeezley

I was reading the celery issue thread but not in super detail... is the issue in celery or kombu?

zachmullen avatar Nov 07 '19 18:11 zachmullen

Its not clear what the issue is, there is no consistently reproducible minimum working example of the error.

kotfic avatar Nov 07 '19 18:11 kotfic

https://python-rq.org/

🤷‍♂

zachmullen avatar Nov 07 '19 18:11 zachmullen

@manthey Is there a practical reason this can't be mitigated downstream in your project code or is your concern more ideological?

kotfic avatar Nov 07 '19 19:11 kotfic

Of course it can be mitigated downstream.

I wasted quite a bit of time trying to figure out what was wrong with code when the problem was that girder_worker recommends a broken set of packages -- rabbitmq with current celery. Unless we prevent it (by pinning celery), or advice against using rabbitmq (and have a working example of a different broker), anyone trying to use this project is doomed to sorrow and wasted time. In trying to use the latest release of girder_worker, it works the first time and then fails "randomly".

manthey avatar Nov 07 '19 19:11 manthey

I'm glad to hear that in the short term this can be mitigated downstream. Unless everyone else is secretly having this problem with Girder Worker and just not speaking up that's the fix until 4.4.0 is released.

Can I get some feedback from @girder/developers as to whether this widely observed behavior?

kotfic avatar Nov 08 '19 13:11 kotfic

I haven't seen it, though to be fair I'm not currently administering lots of G_W jobs in the wild.

I'm on board with not trying to fix this here given that we don't actually even know where the problem is.

zachmullen avatar Nov 08 '19 14:11 zachmullen

girder_worker recommends a broken set of packages

Earlier in the thread it was mentioned that this couldn't be minimally reproduced, is that not the case? If such a repro exists I could be persuaded that it's appropriate to try to fix this in g_worker. Barring that, my suggestion would be that we should add a warning to the documentation along the lines of "what to do if you see this error", and recommend downstream pinning.

zachmullen avatar Nov 08 '19 14:11 zachmullen

I'm not using girder-worker, but anecdotally I've been running Celery 4.3.0 with RabbitMQ on multiple projects for several months without running into this issue.

For what it's worth, overly specific pinning has caused my downstream projects a lot of pain in the past since it's more likely to lead to an unresolvable set of packages.

danlamanna avatar Nov 08 '19 15:11 danlamanna

It is useful to know that Celery 4.3 (particularly via girder_worker) is being used successfully with RabbitMQ. I'll make a PR for adding to the documentation to the effect of "if you see this problem, try this".

manthey avatar Nov 08 '19 15:11 manthey

Great, thanks @manthey

zachmullen avatar Nov 08 '19 15:11 zachmullen