richie icon indicating copy to clipboard operation
richie copied to clipboard

Richie Gunicorn worker freezes and no stack trace on log

Open igobranco opened this issue 4 years ago • 5 comments

During my work with putting NAU richie site installation on premises, instead of using the AWS like FUN, I found a problem that made the Gunicorn worker to timeout. The problem was something strange, that probably won't happen on other installations. The virtual machine that has the Richie didn't had connectivity to the on-premises S3 Bucket provider (Ceph).

My problem was not having at least a single line of stack trace on richie app logs. I don't know if anyone has some related strange problems, but after trying different approaches I found that this works and I can see the errors !!

So for us the solution was to add this block to the end of 'app.py' file on nau-richie-site-factory.

Note: I think the problem is not related to Richie core, but to our installation.

Code block

# Handler that forces Gunicorn to print the current stack trace, when its worker timeout’s.
# This can happen when a request hangs waiting for an external service dependency, like when a
# database is too slow or can connect to the S3 Bucket provider. The worker receives the SIGABRT
# signal from the Gunicorn parent process and this handler runs before the worker it stopped.
# Reference:
# https://stackoverflow.com/questions/15442361/dump-django-stack-trace-on-gunicorn-timeout
def worker_abort(worker):
    import traceback, io
    debug_info = io.StringIO()
    debug_info.write("Traceback at time of timeout:\n")
    traceback.print_stack(file=debug_info)
    worker.log.critical(debug_info.getvalue())

Log lines that I could view on richie_app docker container after adding the code block on 'app.py' file.

[2021-09-02 16:20:13 +0100] [13] [INFO] Worker exiting (pid: 13)
[2021-09-02 15:20:13 +0000] [17] [INFO] Booting worker with pid: 17
[2021-09-02 15:21:02 +0000] [10] [CRITICAL] WORKER TIMEOUT (pid:14)
[2021-09-02 16:21:02 +0100] [14] [CRITICAL] Traceback at time of timeout:
  File "/usr/local/bin/gunicorn", line 8, in <module>
    sys.exit(run())
  File "/usr/local/lib/python3.7/site-packages/gunicorn/app/wsgiapp.py", line 58, in run
    WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]").run()
  File "/usr/local/lib/python3.7/site-packages/gunicorn/app/base.py", line 228, in run
    super().run()
  File "/usr/local/lib/python3.7/site-packages/gunicorn/app/base.py", line 72, in run
    Arbiter(self).run()
  File "/usr/local/lib/python3.7/site-packages/gunicorn/arbiter.py", line 202, in run
    self.manage_workers()
  File "/usr/local/lib/python3.7/site-packages/gunicorn/arbiter.py", line 545, in manage_workers
    self.spawn_workers()
  File "/usr/local/lib/python3.7/site-packages/gunicorn/arbiter.py", line 616, in spawn_workers
    self.spawn_worker()
  File "/usr/local/lib/python3.7/site-packages/gunicorn/arbiter.py", line 583, in spawn_worker
    worker.init_process()
  File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/base.py", line 140, in init_process
    self.run()
  File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 123, in run
    self.run_for_one(timeout)
  File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 67, in run_for_one
    self.accept(listener)
  File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 29, in accept
    self.handle(listener, client, addr)
  File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 134, in handle
    self.handle_request(listener, req, client, addr)
  File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 175, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File "/usr/local/lib/python3.7/site-packages/django/core/handlers/wsgi.py", line 133, in __call__
    response = self.get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/core/handlers/base.py", line 130, in get_response
    response = self._middleware_chain(request)
  File "/usr/local/lib/python3.7/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/utils/deprecation.py", line 114, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/utils/deprecation.py", line 114, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/utils/deprecation.py", line 114, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/utils/deprecation.py", line 114, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/utils/deprecation.py", line 114, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/utils/deprecation.py", line 114, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/utils/deprecation.py", line 114, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/utils/deprecation.py", line 114, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/utils/deprecation.py", line 114, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/utils/deprecation.py", line 114, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/utils/deprecation.py", line 114, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/utils/deprecation.py", line 114, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/utils/deprecation.py", line 114, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/utils/deprecation.py", line 114, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/utils/deprecation.py", line 114, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/utils/deprecation.py", line 114, in __call__
    response = response or self.get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/local/lib/python3.7/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
    return view_func(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/filer/admin/clipboardadmin.py", line 115, in ajax_upload
    file_obj.save()
  File "/usr/local/lib/python3.7/site-packages/filer/models/imagemodels.py", line 30, in save
    exif_date = self.exif.get('DateTimeOriginal', None)
  File "/usr/local/lib/python3.7/site-packages/filer/models/abstract.py", line 88, in _get_exif
    self._exif_cache = get_exif_for_file(self.file)
  File "/usr/local/lib/python3.7/site-packages/filer/utils/pil_exif.py", line 19, in get_exif_for_file
    im = PILImage.open(storage.open(file_obj.name), 'r')
  File "/usr/local/lib/python3.7/site-packages/django/core/files/storage.py", line 38, in open
    return self._open(name, mode)
  File "/usr/local/lib/python3.7/site-packages/storages/backends/s3boto3.py", line 427, in _open
    f = S3Boto3StorageFile(name, mode, self)
  File "/usr/local/lib/python3.7/site-packages/storages/backends/s3boto3.py", line 107, in __init__
    self.obj.load()
  File "/usr/local/lib/python3.7/site-packages/boto3/resources/factory.py", line 505, in do_action
    response = action(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/boto3/resources/action.py", line 83, in __call__
    response = getattr(parent.meta.client, operation_name)(*args, **params)
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 663, in _make_api_call
    operation_model, request_dict, request_context)
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 682, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/usr/local/lib/python3.7/site-packages/botocore/endpoint.py", line 102, in make_request
    return self._send_request(request_dict, operation_model)
  File "/usr/local/lib/python3.7/site-packages/botocore/endpoint.py", line 135, in _send_request
    request, operation_model, context)
  File "/usr/local/lib/python3.7/site-packages/botocore/endpoint.py", line 167, in _get_response
    request, operation_model)
  File "/usr/local/lib/python3.7/site-packages/botocore/endpoint.py", line 200, in _do_get_response
    http_response = self._send(request)
  File "/usr/local/lib/python3.7/site-packages/botocore/endpoint.py", line 269, in _send
    return self.http_session.send(request)
  File "/usr/local/lib/python3.7/site-packages/botocore/httpsession.py", line 317, in send
    chunked=self._chunked(request.headers),
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 394, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 234, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/local/lib/python3.7/http/client.py", line 1281, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.7/site-packages/botocore/awsrequest.py", line 92, in _send_request
    method, url, body, headers, *args, **kwargs)
  File "/usr/local/lib/python3.7/http/client.py", line 1327, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1276, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/site-packages/botocore/awsrequest.py", line 119, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.7/site-packages/botocore/awsrequest.py", line 203, in send
    return super(AWSConnection, self).send(str)
  File "/usr/local/lib/python3.7/http/client.py", line 976, in send
    self.connect()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 200, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 170, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 86, in create_connection
    sock.connect(sa)
  File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/base.py", line 200, in handle_abort
    self.cfg.worker_abort(self)
  File "/usr/local/etc/gunicorn/app.py", line 27, in worker_abort
    traceback.print_stack(file=debug_info)

[2021-09-02 16:21:02 +0100] [14] [INFO] Worker exiting (pid: 14)
[2021-09-02 15:21:02 +0000] [18] [INFO] Booting worker with pid: 18

igobranco avatar Sep 03 '21 10:09 igobranco

Hi @igobranco :wave:

Thanks for the report! Can you paste your gunicorn configuration? Is it similar to the default one we use [1]?

[1] https://github.com/openfun/richie/blob/master/docker/files/usr/local/etc/gunicorn/richie.py

jmaupetit avatar Sep 03 '21 15:09 jmaupetit

Hi @igobranco

Thanks for the report! Can you paste your gunicorn configuration? Is it similar to the default one we use [1]?

[1] https://github.com/openfun/richie/blob/master/docker/files/usr/local/etc/gunicorn/richie.py

Hi @jmaupetit Yes, it's similar.

igobranco avatar Sep 03 '21 15:09 igobranco

Particularly the logging configuration, right?

jmaupetit avatar Sep 06 '21 09:09 jmaupetit

Yes, the file is equal with the base richie.py, event the logging configuration.

NAU version after adding that additional code block: https://github.com/fccn/nau-richie-site-factory/blob/master/docker/files/usr/local/etc/gunicorn/app.py

igobranco avatar Sep 06 '21 11:09 igobranco

Sorry for the noise: I didn't notice that you already posted your configuration in the issue description... :sweat_smile:

jmaupetit avatar Sep 06 '21 12:09 jmaupetit