billiard
billiard copied to clipboard
WorkerLostError: Worker exited prematurely: exitcode 0 when shutting down worker
I'm new to Celery, so do correct me if I'm missing the obvious!
When Ctrl+Cing (or sending SIGTERM) to the Celery process, Im getting WorkerLostErrors a lot of (but not all) the time:
[2016-11-29 16:18:07,837: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: exitcode 0.',)
Traceback (most recent call last):
File "/home/ubuntu/venv/lib/python3.5/site-packages/celery/worker/worker.py", line 203, in start
self.blueprint.start(self)
File "/home/ubuntu/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/home/ubuntu/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 370, in start
return self.obj.start()
File "/home/ubuntu/venv/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/home/ubuntu/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/home/ubuntu/venv/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 593, in start
c.loop(*c.loop_args())
File "/home/ubuntu/venv/lib/python3.5/site-packages/celery/worker/loops.py", line 88, in asynloop
next(loop)
File "/home/ubuntu/venv/lib/python3.5/site-packages/kombu/async/hub.py", line 289, in create_loop
events = poll(poll_timeout)
File "/home/ubuntu/venv/lib/python3.5/site-packages/kombu/utils/eventio.py", line 84, in poll
return self._epoll.poll(timeout if timeout is not None else -1)
File "/home/ubuntu/venv/lib/python3.5/site-packages/celery/apps/worker.py", line 282, in _handle_request
raise exc(exitcode)
celery.exceptions.WorkerShutdown: 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/venv/lib/python3.5/site-packages/billiard/pool.py", line 1224, in mark_as_worker_lost
human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: exitcode 0.
I'm seeing this on both macOS 10.12.1 and Ubuntu 16.04
To reproduce: Load the queue: python task_test.py
Run the worker: celery worker -A task_test --loglevel=INFO --concurrency=10
Give it a few seconds for some tasks to run and complete, and Ctrl+C the worker.
Expected behaviour: Tasks exit cleanly
Observed behaviour: WorkerLostErrors
Minimal test case task_test.py:
import time
import random
from celery import Celery
app = Celery('tasks', broker='redis://127.0.0.1')
@app.task(name='test_task')
def test_task(wait_for):
time.sleep(wait_for)
if __name__ == '__main__':
for x in range(1000):
test_task.delay(random.randrange(1, 10))
celery -A task_test report
software -> celery:4.0.0 (latentcall) kombu:4.0.0 py:3.5.2
billiard:3.5.0.2 redis:2.10.5
platform -> system:Linux arch:64bit, ELF imp:CPython
loader -> celery.loaders.app.AppLoader
settings -> transport:redis results:disabled
broker_url: 'redis://127.0.0.1:6379//'
I'm seeing this on both macOS 10.12.1 and Ubuntu 16.04
I've not been able to reproduce this here on macOS 10.12.2beta, also using Python 3.5.2.
Is there any particular error message on macOS? The epoll one is definitely Linux, and that error shouldn't happen AFAIK.
Thanks for looking at this @ask - very much the same error for me on macOS, I've attached a run on macOS using --loglevel=DEBUG if that's of any use.
Interesting that you can't reproduce it - I believe my environments to be fairly vanilla (certainly the Ubuntu one was a fresh venv, with just celery installed), but I'll poke around some more as time allows and see if I can't glean any more insight into what's going on.
Can chime in and say that I've having the same issue (mac os 10.12.1 and python 3.5.2).
It seems to mostly come up when sending SIGINT to longer running tasks, if I bail during a series of small tasks celery wraps those up and shuts down cleanly.
I have a series of longer running tasks that are run in a chord that produce this error 100% of the time.
FWIW, this also happens on
- Ubuntu 16.04 (xenial) with python 2.7.12 inside of a virtualenv (my local test)
software -> celery:4.0.2 (latentcall) kombu:4.0.2 py:2.7.12
billiard:3.5.0.2 redis:2.10.5
platform -> system:Linux arch:64bit, ELF imp:CPython
loader -> celery.loaders.app.AppLoader
settings -> transport:redis results:disabled
broker_url: u'redis://127.0.0.1:6379//'
- Debian 8.5 (jessie) with python 2.7.9 outside of virtualenv
software -> celery:4.0.1 (latentcall) kombu:4.0.1 py:2.7.9
billiard:3.5.0.2 redis:2.10.5
platform -> system:Linux arch:64bit, ELF imp:CPython
loader -> celery.loaders.app.AppLoader
settings -> transport:redis results:redis://:**@video-redis.video.eqiad.wmflabs:6379/1
CELERY_TASK_RESULT_EXPIRES: 2592000
BROKER_URL: u'redis://:********@video-redis.video.eqiad.wmflabs:6379/2'
CELERY_ACCEPT_CONTENT: ['json', 'pickle']
CELERY_RESULT_BACKEND: u'redis://:********@video-redis.video.eqiad.wmflabs:6379/1'
Ubuntu 16.10
software -> celery:4.0.2 (latentcall) kombu:4.0.2 py:3.5.2+
billiard:3.5.0.2 py-amqp:2.1.4
platform -> system:Linux arch:64bit imp:CPython
loader -> celery.loaders.default.Loader
settings -> transport:amqp results:db+sqlite:///results.sqlite
I'm also getting this, also with tasks that are run in chords (https://github.com/mathiasose/CA-NEAT/blob/242abe4d6969d91fdc732b2d4e86092f2842febc/ca_neat/run_experiment.py#L58).
I also had this issue on Celery 3 before upgrading.
Is this issue solved? I also hit this problem.
I hit this problem when exit my program in the program it self.
I use
exit(0)
and then the celery raise this exception.
I am seeing this error as well, running Celery 4.0.2 with supervisord (I'm not on 4.1.0 due to this PR: https://github.com/celery/celery/pull/4173 not being released yet).
I'm also a bit confused on whether we should or should NOT be using stopasgroup
in supervisor configs. The sample config in master has it set as true, but this thread from last year mentions turning off stopasgroup.
Currently the process is stuck in a STOPPING
state in supervisord.
error:
[2017-12-06 09:54:19,361: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).',)
Traceback (most recent call last):
File "/home/me/app/env/lib/python3.5/site-packages/celery/worker/worker.py", line 203, in start
self.blueprint.start(self)
File "/home/me/app/env/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/home/me/app/env/lib/python3.5/site-packages/celery/bootsteps.py", line 370, in start
return self.obj.start()
File "/home/me/app/env/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/home/me/app/env/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/home/me/app/env/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 594, in start
c.loop(*c.loop_args())
File "/home/me/app/env/lib/python3.5/site-packages/celery/worker/loops.py", line 77, in asynloop
raise WorkerShutdown(should_stop)
celery.exceptions.WorkerShutdown: 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/me/app/env/lib/python3.5/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM).
celery report:
software -> celery:4.0.2 (latentcall) kombu:4.1.0 py:3.5.2
billiard:3.5.0.3 py-amqp:2.2.2
platform -> system:Linux arch:64bit, ELF imp:CPython
loader -> celery.loaders.app.AppLoader
settings -> transport:amqp results:disabled
supervisor conf:
[program:worker_myproject]
command=/home/me/app/env/bin/newrelic-admin run-program /home/me/app/env/bin/celery -A ovmyproject worker --autoscale=7,1 -n worker_myproject.%%h -Q myproject --loglevel=INFO
process_name=%(program_name)s
numprocs=1
numprocs_start=0
priority=998
autostart=true
autorestart=true
startsecs=10
startretries=3
exitcodes=0,2
stopsignal=TERM
stopwaitsecs=1200
stopasgroup=true
killasgroup=true
user=me
redirect_stderr=false
stdout_logfile=/var/log/celery/worker_myproject.log
stdout_logfile_maxbytes=50MB
stdout_logfile_backups=10
stdout_capture_maxbytes=0
stdout_events_enabled=false
stderr_logfile=/var/log/celery/worker_myproject.log
stderr_logfile_maxbytes=50MB
stderr_logfile_backups=10
stderr_capture_maxbytes=0
stderr_events_enabled=false
environment=NEW_RELIC_CONFIG_FILE="/home/me/app/newrelic.ini",NEW_RELIC_ENVIRONMENT="production"
directory=/home/me/app/src
serverurl=AUTO
I tried removing stopasgroup
and killasgroup
, now, the error is billiard.exceptions.WorkerLostError: Worker exited prematurely: exitcode 0.
Would that mean I need to raise the amount of time that supervisor will wait before killing a process?
I tried asking my question about stopasgroup
on IRC, but didn't get much response. Would like to reiterate that it's quite confusing to have 2 opposite answers to the question present within official celery channels but no clear and definitive section in the docs to reference.
could you also try on master?
@auvipy on master celery
, kombu
, and billiard
? Is there any reason to believe a recent commit fixed the issue?
@charettes since I have seen many times people using older versions and complaining about issues, some of which sometime get fixed but we don't keep track them well. The last reporter seems to use older release, as we can't verify all the issues so telling reporters to verify on latest master branch.
I'm getting this:
[2018-01-02 13:21:46,478: INFO/ForkPoolWorker-4] worker_mongo.dirty_deeds_mongo[e555476c-d108-4721-b902-10c70bb67ff4]: Working on <CENSORED>
^C
worker: Hitting Ctrl+C again will terminate all running tasks!
worker: Warm shutdown (MainProcess)
[2018-01-02 13:23:24,704: INFO/ForkPoolWorker-7] worker_mongo.dirty_deeds_mongo[d75bc84b-6265-45c7-b70a-156172fe5b14]: Working on <CENSORED>
[2018-01-02 13:45:08,060: INFO/ForkPoolWorker-2] worker_mongo.dirty_deeds_mongo[f87ac4f7-364c-40de-ab4f-3a09aad24fe5]: Inserted 1408237 documents from file <CENSORED> for org <CENSORED>
[2018-01-02 13:45:10,510: INFO/ForkPoolWorker-2] Task worker_mongo.dirty_deeds_mongo[f87ac4f7-364c-40de-ab4f-3a09aad24fe5] succeeded in 1404.13060093s: True
[2018-01-02 13:45:21,279: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: exitcode 0.',)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
human_status(exitcode)),
WorkerLostError: Worker exited prematurely: exitcode 0.
With celery==4.1.0
and billiard==3.5.0.3
on FreeBSD 11.
Later, I eventually run into:
[2018-01-02 13:51:21,076: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 11 (SIGSEGV).',)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
human_status(exitcode)),
WorkerLostError: Worker exited prematurely: signal 11 (SIGSEGV).
What's the command you're using to start the worker(s) @AlmightyOatmeal ?
@DomHudson:
$ celery worker -A worker_mongo -l info --concurrency=4
Perhaps there is a memory leak - what do you get if you run this command?
grep oom /var/log/messages
Is the WorkerLostError
raised randomly during normal operation or only during shutdown like for the OP?
@DomHudson,
As I mentioned earlier, this isn't Linux so your grep command will not work. There is no OOM condition. These errors only occur during shutdown. The SIGSEGV
may be chalked up to a swapping issue but the first message is completely unrelated and is still an issue whenever I try to shutdown.
@AlmightyOatmeal Sorry my mistake
Hello,
I have the same issue witch happened randomly. Any update about this ??
This is my config :
software -> celery:4.1.0 (latentcall) kombu:4.1.0 py:3.6.5
billiard:3.5.0.3 py-amqp:2.3.2
platform -> system:Linux arch:64bit, ELF imp:CPython
loader -> celery.loaders.default.Loader
settings -> transport:amqps results:disabled
```
WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Traceback (most recent call last):
File "/home/vcap/deps/0/python/lib/python3.6/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
I also have the same issue which happened randomly. Any update on this ??
@rafikmak could you try celery 4.2.1?
I get the same thing in 4.2.1
4.2.1 +1 I also have the same issue
anyone got a solution for this issue?
I have the same problem, and it is becoming critical... (Redis with prefork concurrency, Celery 4.2.1, kombu 4.2.1, billiard 3.5.0.4)
Here is the traceback:
File "/opt/celery/venv/lib64/python3.6/site-packages/celery/worker/loops.py", line 91, in asynloop
next(loop)
File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/asynchronous/hub.py", line 354, in create_loop
cb(*cbargs)
File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/transport/redis.py", line 1040, in on_readable
self.cycle.on_readable(fileno)
File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/transport/redis.py", line 337, in on_readable
chan.handlers[type]()
File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/transport/redis.py", line 667, in _receive
ret.append(self._receive_one(c))
File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/transport/redis.py", line 697, in _receive_one
message, self._fanout_to_queue[exchange])
File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/transport/virtual/base.py", line 983, in _deliver
callback(message)
File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/transport/virtual/base.py", line 633, in _callback
return callback(message)
File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/messaging.py", line 624, in _receive_callback
return on_m(message) if on_m else self.receive(decoded, message)
File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/messaging.py", line 590, in receive
[callback(body, message) for callback in callbacks]
File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/messaging.py", line 590, in <listcomp>
[callback(body, message) for callback in callbacks]
File "/opt/celery/venv/lib64/python3.6/site-packages/celery/worker/pidbox.py", line 46, in on_message
self.node.handle_message(body, message)
File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/pidbox.py", line 129, in handle_message
return self.dispatch(**body)
File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/pidbox.py", line 101, in dispatch
reply = handle(method, arguments)
File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/pidbox.py", line 122, in handle_cast
return self.handle(method, arguments)
File "/opt/celery/venv/lib64/python3.6/site-packages/kombu/pidbox.py", line 116, in handle
return self.handlers[method](self.state, **arguments)
File "/opt/celery/venv/lib64/python3.6/site-packages/celery/worker/control.py", line 520, in shutdown
raise WorkerShutdown(msg)
celery.exceptions.WorkerShutdown: Got shutdown from remote
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/celery/venv/lib64/python3.6/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: exitcode 0.
The same issue. Reproduces for cuncurrency=10
As a workaround CELERY_TASK_ACKS_LATE = True
"fixes" the exception but it causes another minor issue: tasks which were executing when termination occured had finished before shut down AND they will be invoked again after worker restart.
did you found any solution?
In my case, my workers were choking on objects that were children of BaseException
. Changing them to Exception
seems to have resolved my issue.
@jjorissen52 "choking on" as in "trying to have those as parameters to the .delay
ed function"?
Edit: Probably more like "unhandled Exception inside the worker @celery.task
function".
So it is an unhandled exception, and a try
block could solve it?
In that case, is the setting of ignore_result
as in @celery.task(ignore_result=True)
intended to change any of that behavior?
@luckydonald
Probably more like "unhandled Exception inside the worker @celery.task function".
That's the one!
from . celery import app
class ChildOfBaseException(BaseException):
pass
class ChildOfException(Exception):
pass
@app.task(bind=True)
def killed_by_base_exception(self):
raise ChildOfBaseException(':(')
@app.task(bind=True)
def lives_through_exception(self):
raise ChildOfException(':)')
Yields
[2019-01-22 14:14:21,562: ERROR/ForkPoolWorker-8] Task celery_crash.tasks.lives_through_exception[b6552400-8b3c-4381-a978-4b766b9d19f7] raised unexpected: ChildOfException(':)',)
Traceback (most recent call last):
File "/home/me/venvs/celery_crash/lib/python3.6/site-packages/celery/app/trace.py", line 382, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/me/venvs/celery_crash/lib/python3.6/site-packages/celery/app/trace.py", line 641, in __protected_call__
return self.run(*args, **kwargs)
File "/home/me/Project/celery_crash/celery_crash/tasks.py", line 19, in lives_through_exception
raise ChildOfException(':)')
celery_crash.tasks.ChildOfException: :)
[2019-01-22 14:14:27,637: INFO/MainProcess] Received task: celery_crash.tasks.killed_by_base_exception[d576126e-b353-4654-b0bb-12f327dad38c]
[2019-01-22 14:14:27,654: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: exitcode 0.',)
Traceback (most recent call last):
File "/home/me/venvs/celery_crash/lib/python3.6/site-packages/billiard/pool.py", line 1223, in mark_as_worker_lost
human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: exitcode 0.
Edit: I'm noticing now that this thread was started by workers getting lost specifically when they are shutdown, so maybe my contribution is off topic.