cookiecutter-django icon indicating copy to clipboard operation
cookiecutter-django copied to clipboard

Race condition causing celerybeat and celeryworker to fail

Open malefice opened this issue 3 years ago • 7 comments

What happened?

There is a race condition where celerybeat and/or celeryworker services will fail to start if the django-celery-beat migrations have not run. Bringing up the stack up again once those migrations have run will resume normally as expected.

What should've happened instead?

While this can only occur under specific conditions (usually first run), it is best to have a guarantee that all services will be up and running 100% out-of-the-box. The celerybeat and/or celeryworker services should have a health check in its entrypoint script prior to running its command script. I personally have some code for this if you guys are interested, but I want to hear your thoughts first.

Steps to reproduce

Tested on Ubuntu 18.04.4 LTS, Docker Engine version 19.03.12, docker-compose version 1.17.1

  1. Create a new project from this template as you normally would but with docker and celery support.
  2. Build and bring the whole stack up. The race condition usually shows up for me on the first run.
  3. If it did not trigger the first time, then drop the database volume, and bring up the stack. You may need to repeat this step multiple times.

Some screenshots

image

image

malefice avatar Aug 06 '20 08:08 malefice

Thanks for the great explanations in this bug report. It would be nice to fix it, yes. You're saying you have some code to fix it, I'd be interested in knowing more about your solution?

browniebroke avatar Aug 06 '20 09:08 browniebroke

I am currently using manage.py showmigrations app_name to do this. It lists down all the migrations for the app, and a migration file has not yet been applied if [ ] is present like so:

image

To ensure that all the migrations for django_celery_beat have run properly, we can grep the output of the command. This was how I did it.

django_celery_migrations_complete() {
  count=$(python manage.py showmigrations django_celery_beat | grep '\[ \]' | wc -l)
  if [[ $count -eq 0 ]]; then
      return 0
  fi
    return 1
}

until django_celery_migrations_complete; do
  >&2 echo 'Waiting for Django Celery migrations to complete...'
  sleep 5
done
>&2 echo 'Django Celery migrations have been applied'

The only thing I cannot avoid is a little code duplication. Since this requires database access, the excerpt above must be preceded by the same postgres health check present in the django service's entrypoint script. I am not sure how to keep things DRY, so I implemented it such that I have a separate entrypoint script for the django service, and another entrypoint script for the celery services like so:

# Inside docker-compose file:

  celerybeat:
    <<: *django
    image: celerybeat
    depends_on:
      - redis
      - postgres
    ports: []
    entrypoint: /celery-entrypoint
    command: /start-celerybeat

  celeryworker:
    <<: *django
    image: celeryworker
    depends_on:
      - redis
      - postgres
    ports: []
    entrypoint: /celery-entrypoint
    command: /start-celeryworker

malefice avatar Aug 06 '20 10:08 malefice

Just to clarify one small thing, are you having this issue locally on in production?

As a reminder/notes of things work locally:

  • We automatically run the migration from the django container:

    https://github.com/pydanny/cookiecutter-django/blob/4978aa79135374c36ab1b9bcc07c6e4a3364eae8/%7B%7Bcookiecutter.project_slug%7D%7D/compose/local/django/start#L8

  • We don't have any dependencies between the django and celery services in the local.yml file:

    https://github.com/pydanny/cookiecutter-django/blob/1b5997466c93339d411d4a11c7d733d4d2250fbd/%7B%7Bcookiecutter.project_slug%7D%7D/local.yml#L14-L18

    https://github.com/pydanny/cookiecutter-django/blob/1b5997466c93339d411d4a11c7d733d4d2250fbd/%7B%7Bcookiecutter.project_slug%7D%7D/local.yml#L74-L79

I was initially thinking to make celery services depend on django in the docker-compose file. Not sure of the drawbacks of such solution though.

Another idea I've read about was to declare another service just for migrations and make all others depend on it.

The depends_on from docker-compose might not be enough though. I think that might start the celery containers too soon still, in which case we'll need to run something inside the start Celery scripts to wait for migrations, a bit like you do.

One other drawback I foresee with the solution is that it depends on the output of the presentational showmigrations command, which isn't ideal. If the Django project decides to reformat the output, this will break.

I've had some issues in the past when trying to parse that output, it was fine for my toy project, but not something I'm too keen to add to a popular open source project.

These are just my thoughts from a quick look, let me know if I've missed anything. Other ideas are welcome of course.

browniebroke avatar Aug 07 '20 10:08 browniebroke

Just to clarify one small thing, are you having this issue locally on in production?

I am running this locally, and I am aware that migrations will be run automatically on the django container for local development.

Another idea I've read about was to declare another service just for migrations and make all others depend on it.

The depends_on from docker-compose might not be enough though. I think that might start the celery containers too soon still, in which case we'll need to run something inside the start Celery scripts to wait for migrations, a bit like you do.

I have tried using depends_on, and it does not work, because according to the docker docs, depends_on controls only the sequence in which the services are started by docker-compose, and it does not care about the results. On that note, delegating a separate container to handle migrations upon which the other django-based containers will depend will also not work. I tried that approach anyway just to gather these screenshots:

image image

In the screenshots, django-bootstrap is the container responsible for running migrations, and the other django-ish containers "depends" on it in the compose file. You can clearly see that the start scripts do not run in the order one would expect. Only the "initial starting" of the containers will be in dependency order.

One other drawback I foresee with the solution is that it depends on the output of the presentational showmigrations command, which isn't ideal. If the Django project decides to reformat the output, this will break.

The project already pins a Django version, so in my opinion, that becomes less of an issue. An alternative is to issue raw SQL queries to verify that all the migrations for django-celery-beat have run. That is much more complicated VS fixing a much simpler parsing issue that may occur only when you change the pinned Django version, and based on this, it does not happen often, if at all at least for showmigrations.

It is also an alternative to just not fix this race condition at all. Just document it as such, and tell users to first build, then run docker-compose -f local.yml run --rm django python manage.py migrate, then bring the services up as part of first time setup.

malefice avatar Aug 07 '20 12:08 malefice

@malefice That's pretty much the reason for the Travis implementation for Docker. This initial migration issue that I wanted to fix (obviously failed) could avoid this condition if celerybeat had its own entrypoint file that we can specify in the compose file.

Andrew-Chen-Wang avatar Aug 19 '20 23:08 Andrew-Chen-Wang

if using docker-compose 1.29 or later, will the following condition work?

  depends_on: service_completed_successfully

Also, we can always run migration before starting the server.

xjlin0 avatar Jan 16 '22 15:01 xjlin0

Someone tried to fix that issue in #4459 and after taking further look at this, I'm not convinced it's a bug worth fixing. The error will happen once, the first time you start you server, and will pretty much go away for ever. The error being returned is quite clear, and fixing it properly would involve hiding the problem.

I'm marking it as won't fix, unless someone comes up with a simple and foolproof solution...

browniebroke avatar Jan 20 '24 14:01 browniebroke