billiard icon indicating copy to clipboard operation
billiard copied to clipboard

WorkerLostError: Worker exited prematurely: exitcode 0 when shutting down worker

Open tabascoterrier opened this issue 8 years ago • 60 comments

I'm new to Celery, so do correct me if I'm missing the obvious!

When Ctrl+Cing (or sending SIGTERM) to the Celery process, Im getting WorkerLostErrors a lot of (but not all) the time:

[2016-11-29 16:18:07,837: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: exitcode 0.',)
Traceback (most recent call last):
  File "/home/ubuntu/venv/lib/python3.5/site-packages/celery/worker/worker.py", line 203, in start
    self.blueprint.start(self)
  File "/home/ubuntu/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/home/ubuntu/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 370, in start
    return self.obj.start()
  File "/home/ubuntu/venv/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 318, in start
    blueprint.start(self)
  File "/home/ubuntu/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/home/ubuntu/venv/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 593, in start
    c.loop(*c.loop_args())
  File "/home/ubuntu/venv/lib/python3.5/site-packages/celery/worker/loops.py", line 88, in asynloop
    next(loop)
  File "/home/ubuntu/venv/lib/python3.5/site-packages/kombu/async/hub.py", line 289, in create_loop
    events = poll(poll_timeout)
  File "/home/ubuntu/venv/lib/python3.5/site-packages/kombu/utils/eventio.py", line 84, in poll
    return self._epoll.poll(timeout if timeout is not None else -1)
  File "/home/ubuntu/venv/lib/python3.5/site-packages/celery/apps/worker.py", line 282, in _handle_request
    raise exc(exitcode)
celery.exceptions.WorkerShutdown: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/venv/lib/python3.5/site-packages/billiard/pool.py", line 1224, in mark_as_worker_lost
    human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: exitcode 0.

I'm seeing this on both macOS 10.12.1 and Ubuntu 16.04

To reproduce: Load the queue: python task_test.py

Run the worker: celery worker -A task_test --loglevel=INFO --concurrency=10

Give it a few seconds for some tasks to run and complete, and Ctrl+C the worker.

Expected behaviour: Tasks exit cleanly

Observed behaviour: WorkerLostErrors

Minimal test case task_test.py:

import time
import random
from celery import Celery

app = Celery('tasks', broker='redis://127.0.0.1')

@app.task(name='test_task')
def test_task(wait_for):
    time.sleep(wait_for)

if __name__ == '__main__':
    for x in range(1000):
        test_task.delay(random.randrange(1, 10))

celery -A task_test report

software -> celery:4.0.0 (latentcall) kombu:4.0.0 py:3.5.2
            billiard:3.5.0.2 redis:2.10.5
platform -> system:Linux arch:64bit, ELF imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:redis results:disabled

broker_url: 'redis://127.0.0.1:6379//'

I'm seeing this on both macOS 10.12.1 and Ubuntu 16.04

tabascoterrier avatar Nov 29 '16 16:11 tabascoterrier

I've not been able to reproduce this here on macOS 10.12.2beta, also using Python 3.5.2.

Is there any particular error message on macOS? The epoll one is definitely Linux, and that error shouldn't happen AFAIK.

ask avatar Nov 30 '16 00:11 ask

Thanks for looking at this @ask - very much the same error for me on macOS, I've attached a run on macOS using --loglevel=DEBUG if that's of any use.

Interesting that you can't reproduce it - I believe my environments to be fairly vanilla (certainly the Ubuntu one was a fresh venv, with just celery installed), but I'll poke around some more as time allows and see if I can't glean any more insight into what's going on.

debug.txt

tabascoterrier avatar Nov 30 '16 14:11 tabascoterrier

Can chime in and say that I've having the same issue (mac os 10.12.1 and python 3.5.2).

It seems to mostly come up when sending SIGINT to longer running tasks, if I bail during a series of small tasks celery wraps those up and shuts down cleanly.

I have a series of longer running tasks that are run in a chord that produce this error 100% of the time.

sterutkb avatar Dec 13 '16 14:12 sterutkb

FWIW, this also happens on

  • Ubuntu 16.04 (xenial) with python 2.7.12 inside of a virtualenv (my local test)
software -> celery:4.0.2 (latentcall) kombu:4.0.2 py:2.7.12
            billiard:3.5.0.2 redis:2.10.5
platform -> system:Linux arch:64bit, ELF imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:redis results:disabled

broker_url: u'redis://127.0.0.1:6379//'
  • Debian 8.5 (jessie) with python 2.7.9 outside of virtualenv
software -> celery:4.0.1 (latentcall) kombu:4.0.1 py:2.7.9
            billiard:3.5.0.2 redis:2.10.5
platform -> system:Linux arch:64bit, ELF imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:redis results:redis://:**@video-redis.video.eqiad.wmflabs:6379/1

CELERY_TASK_RESULT_EXPIRES: 2592000
BROKER_URL: u'redis://:********@video-redis.video.eqiad.wmflabs:6379/2'
CELERY_ACCEPT_CONTENT: ['json', 'pickle']
CELERY_RESULT_BACKEND: u'redis://:********@video-redis.video.eqiad.wmflabs:6379/1'

zhuyifei1999 avatar Dec 22 '16 15:12 zhuyifei1999

Ubuntu 16.10

software -> celery:4.0.2 (latentcall) kombu:4.0.2 py:3.5.2+
            billiard:3.5.0.2 py-amqp:2.1.4
platform -> system:Linux arch:64bit imp:CPython
loader   -> celery.loaders.default.Loader
settings -> transport:amqp results:db+sqlite:///results.sqlite

I'm also getting this, also with tasks that are run in chords (https://github.com/mathiasose/CA-NEAT/blob/242abe4d6969d91fdc732b2d4e86092f2842febc/ca_neat/run_experiment.py#L58).

I also had this issue on Celery 3 before upgrading.

mathiasose avatar Feb 14 '17 15:02 mathiasose

Is this issue solved? I also hit this problem.

Wesley-Li avatar Jul 29 '17 07:07 Wesley-Li

I hit this problem when exit my program in the program it self.

I use

exit(0)

and then the celery raise this exception.

kingname avatar Nov 09 '17 05:11 kingname

I am seeing this error as well, running Celery 4.0.2 with supervisord (I'm not on 4.1.0 due to this PR: https://github.com/celery/celery/pull/4173 not being released yet).

I'm also a bit confused on whether we should or should NOT be using stopasgroup in supervisor configs. The sample config in master has it set as true, but this thread from last year mentions turning off stopasgroup.

Currently the process is stuck in a STOPPING state in supervisord.

error:

[2017-12-06 09:54:19,361: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).',)
Traceback (most recent call last):
  File "/home/me/app/env/lib/python3.5/site-packages/celery/worker/worker.py", line 203, in start
    self.blueprint.start(self)
  File "/home/me/app/env/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/home/me/app/env/lib/python3.5/site-packages/celery/bootsteps.py", line 370, in start
    return self.obj.start()
  File "/home/me/app/env/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 318, in start
    blueprint.start(self)
  File "/home/me/app/env/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/home/me/app/env/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 594, in start
    c.loop(*c.loop_args())
  File "/home/me/app/env/lib/python3.5/site-packages/celery/worker/loops.py", line 77, in asynloop
    raise WorkerShutdown(should_stop)
celery.exceptions.WorkerShutdown: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/me/app/env/lib/python3.5/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
    human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM).

celery report:

software -> celery:4.0.2 (latentcall) kombu:4.1.0 py:3.5.2
            billiard:3.5.0.3 py-amqp:2.2.2
platform -> system:Linux arch:64bit, ELF imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:amqp results:disabled

supervisor conf:

[program:worker_myproject]
command=/home/me/app/env/bin/newrelic-admin run-program /home/me/app/env/bin/celery -A ovmyproject worker --autoscale=7,1 -n worker_myproject.%%h -Q myproject --loglevel=INFO
process_name=%(program_name)s
numprocs=1
numprocs_start=0
priority=998
autostart=true
autorestart=true
startsecs=10
startretries=3
exitcodes=0,2
stopsignal=TERM
stopwaitsecs=1200
stopasgroup=true
killasgroup=true
user=me
redirect_stderr=false
stdout_logfile=/var/log/celery/worker_myproject.log
stdout_logfile_maxbytes=50MB
stdout_logfile_backups=10
stdout_capture_maxbytes=0
stdout_events_enabled=false
stderr_logfile=/var/log/celery/worker_myproject.log
stderr_logfile_maxbytes=50MB
stderr_logfile_backups=10
stderr_capture_maxbytes=0
stderr_events_enabled=false
environment=NEW_RELIC_CONFIG_FILE="/home/me/app/newrelic.ini",NEW_RELIC_ENVIRONMENT="production"
directory=/home/me/app/src
serverurl=AUTO

udbhav avatar Dec 06 '17 15:12 udbhav

I tried removing stopasgroup and killasgroup, now, the error is billiard.exceptions.WorkerLostError: Worker exited prematurely: exitcode 0. Would that mean I need to raise the amount of time that supervisor will wait before killing a process?

I tried asking my question about stopasgroup on IRC, but didn't get much response. Would like to reiterate that it's quite confusing to have 2 opposite answers to the question present within official celery channels but no clear and definitive section in the docs to reference.

udbhav avatar Dec 18 '17 19:12 udbhav

could you also try on master?

auvipy avatar Dec 19 '17 15:12 auvipy

@auvipy on master celery, kombu, and billiard? Is there any reason to believe a recent commit fixed the issue?

charettes avatar Dec 19 '17 15:12 charettes

@charettes since I have seen many times people using older versions and complaining about issues, some of which sometime get fixed but we don't keep track them well. The last reporter seems to use older release, as we can't verify all the issues so telling reporters to verify on latest master branch.

auvipy avatar Dec 20 '17 10:12 auvipy

I'm getting this:

[2018-01-02 13:21:46,478: INFO/ForkPoolWorker-4] worker_mongo.dirty_deeds_mongo[e555476c-d108-4721-b902-10c70bb67ff4]: Working on <CENSORED>
^C
worker: Hitting Ctrl+C again will terminate all running tasks!

worker: Warm shutdown (MainProcess)
[2018-01-02 13:23:24,704: INFO/ForkPoolWorker-7] worker_mongo.dirty_deeds_mongo[d75bc84b-6265-45c7-b70a-156172fe5b14]: Working on <CENSORED>
[2018-01-02 13:45:08,060: INFO/ForkPoolWorker-2] worker_mongo.dirty_deeds_mongo[f87ac4f7-364c-40de-ab4f-3a09aad24fe5]: Inserted 1408237 documents from file <CENSORED> for org <CENSORED>
[2018-01-02 13:45:10,510: INFO/ForkPoolWorker-2] Task worker_mongo.dirty_deeds_mongo[f87ac4f7-364c-40de-ab4f-3a09aad24fe5] succeeded in 1404.13060093s: True
[2018-01-02 13:45:21,279: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: exitcode 0.',)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
    human_status(exitcode)),
WorkerLostError: Worker exited prematurely: exitcode 0.

With celery==4.1.0 and billiard==3.5.0.3 on FreeBSD 11.

Later, I eventually run into:

[2018-01-02 13:51:21,076: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 11 (SIGSEGV).',)
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
    human_status(exitcode)),
WorkerLostError: Worker exited prematurely: signal 11 (SIGSEGV).

AlmightyOatmeal avatar Jan 02 '18 19:01 AlmightyOatmeal

What's the command you're using to start the worker(s) @AlmightyOatmeal ?

DomHudson avatar Jan 02 '18 21:01 DomHudson

@DomHudson:

$ celery worker -A worker_mongo -l info --concurrency=4

AlmightyOatmeal avatar Jan 02 '18 22:01 AlmightyOatmeal

Perhaps there is a memory leak - what do you get if you run this command?

grep oom /var/log/messages

Is the WorkerLostError raised randomly during normal operation or only during shutdown like for the OP?

DomHudson avatar Jan 02 '18 23:01 DomHudson

@DomHudson,

As I mentioned earlier, this isn't Linux so your grep command will not work. There is no OOM condition. These errors only occur during shutdown. The SIGSEGV may be chalked up to a swapping issue but the first message is completely unrelated and is still an issue whenever I try to shutdown.

AlmightyOatmeal avatar Jan 02 '18 23:01 AlmightyOatmeal

@AlmightyOatmeal Sorry my mistake

DomHudson avatar Jan 03 '18 00:01 DomHudson

Hello,

I have the same issue witch happened randomly. Any update about this ??

This is my config :

software -> celery:4.1.0 (latentcall) kombu:4.1.0 py:3.6.5
            billiard:3.5.0.3 py-amqp:2.3.2
platform -> system:Linux arch:64bit, ELF imp:CPython
loader   -> celery.loaders.default.Loader
settings -> transport:amqps results:disabled
```

WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',) 
Traceback (most recent call last):
  File "/home/vcap/deps/0/python/lib/python3.6/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
    human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).

rafikmak avatar Aug 30 '18 14:08 rafikmak

I also have the same issue which happened randomly. Any update on this ??

MMohan1 avatar Sep 03 '18 04:09 MMohan1

@rafikmak could you try celery 4.2.1?

auvipy avatar Sep 08 '18 11:09 auvipy

I get the same thing in 4.2.1

LyleScott avatar Sep 26 '18 18:09 LyleScott

4.2.1 +1 I also have the same issue

Dog-Boss avatar Nov 20 '18 02:11 Dog-Boss

anyone got a solution for this issue?

auvipy avatar Nov 22 '18 08:11 auvipy

I have the same problem, and it is becoming critical... (Redis with prefork concurrency, Celery 4.2.1, kombu 4.2.1, billiard 3.5.0.4)

Here is the traceback:

  File "/opt/celery/venv/lib64/python3.6/site-packages/celery/worker/loops.py", line 91, in asynloop
    next(loop)
  File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/asynchronous/hub.py", line 354, in create_loop
    cb(*cbargs)
  File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/transport/redis.py", line 1040, in on_readable
    self.cycle.on_readable(fileno)
  File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/transport/redis.py", line 337, in on_readable
    chan.handlers[type]()
  File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/transport/redis.py", line 667, in _receive
    ret.append(self._receive_one(c))
  File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/transport/redis.py", line 697, in _receive_one
    message, self._fanout_to_queue[exchange])
  File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/transport/virtual/base.py", line 983, in _deliver
    callback(message)
  File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/transport/virtual/base.py", line 633, in _callback
    return callback(message)
  File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/messaging.py", line 624, in _receive_callback
    return on_m(message) if on_m else self.receive(decoded, message)
  File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/messaging.py", line 590, in receive
    [callback(body, message) for callback in callbacks]
  File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/messaging.py", line 590, in <listcomp>
    [callback(body, message) for callback in callbacks]
  File "/opt/celery/venv/lib64/python3.6/site-packages/celery/worker/pidbox.py", line 46, in on_message
    self.node.handle_message(body, message)
  File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/pidbox.py", line 129, in handle_message
    return self.dispatch(**body)
  File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/pidbox.py", line 101, in dispatch
    reply = handle(method, arguments)
  File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/pidbox.py", line 122, in handle_cast
    return self.handle(method, arguments)
  File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/pidbox.py", line 116, in handle
    return self.handlers[method](self.state, **arguments)
  File "/opt/celery/venv/lib64/python3.6/site-packages/celery/worker/control.py", line 520, in shutdown
    raise WorkerShutdown(msg)
celery.exceptions.WorkerShutdown: Got shutdown from remote

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/celery/venv/lib64/python3.6/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
    human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: exitcode 0.

dejlek avatar Nov 27 '18 11:11 dejlek

The same issue. Reproduces for cuncurrency=10 As a workaround CELERY_TASK_ACKS_LATE = True "fixes" the exception but it causes another minor issue: tasks which were executing when termination occured had finished before shut down AND they will be invoked again after worker restart.

zhiltsov-sigma avatar Nov 30 '18 15:11 zhiltsov-sigma

did you found any solution?

auvipy avatar Dec 04 '18 08:12 auvipy

In my case, my workers were choking on objects that were children of BaseException. Changing them to Exception seems to have resolved my issue.

jjorissen52 avatar Jan 10 '19 21:01 jjorissen52

@jjorissen52 "choking on" as in "trying to have those as parameters to the .delayed function"?

Edit: Probably more like "unhandled Exception inside the worker @celery.task function". So it is an unhandled exception, and a try block could solve it? In that case, is the setting of ignore_result as in @celery.task(ignore_result=True) intended to change any of that behavior?

luckydonald avatar Jan 22 '19 08:01 luckydonald

@luckydonald

Probably more like "unhandled Exception inside the worker @celery.task function".

That's the one!

from . celery import app


class ChildOfBaseException(BaseException):
    pass


class ChildOfException(Exception):
    pass


@app.task(bind=True)
def killed_by_base_exception(self):
    raise ChildOfBaseException(':(')


@app.task(bind=True)
def lives_through_exception(self):
    raise ChildOfException(':)')

Yields

[2019-01-22 14:14:21,562: ERROR/ForkPoolWorker-8] Task celery_crash.tasks.lives_through_exception[b6552400-8b3c-4381-a978-4b766b9d19f7] raised unexpected: ChildOfException(':)',)
Traceback (most recent call last):
  File "/home/me/venvs/celery_crash/lib/python3.6/site-packages/celery/app/trace.py", line 382, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/me/venvs/celery_crash/lib/python3.6/site-packages/celery/app/trace.py", line 641, in __protected_call__
    return self.run(*args, **kwargs)
  File "/home/me/Project/celery_crash/celery_crash/tasks.py", line 19, in lives_through_exception
    raise ChildOfException(':)')
celery_crash.tasks.ChildOfException: :)
[2019-01-22 14:14:27,637: INFO/MainProcess] Received task: celery_crash.tasks.killed_by_base_exception[d576126e-b353-4654-b0bb-12f327dad38c]  
[2019-01-22 14:14:27,654: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: exitcode 0.',)
Traceback (most recent call last):
  File "/home/me/venvs/celery_crash/lib/python3.6/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
    human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: exitcode 0.

Edit: I'm noticing now that this thread was started by workers getting lost specifically when they are shutdown, so maybe my contribution is off topic.

jjorissen52 avatar Jan 22 '19 14:01 jjorissen52