apm-agent-python icon indicating copy to clipboard operation
apm-agent-python copied to clipboard

Django + Gunicorn integration on worker timeouts finish current transaction.

Open m-standfuss opened this issue 3 years ago • 10 comments

Is your feature request related to a problem? Please describe. We are in the process of attempting to replace our Sentry integrations noticed that when we get the occasional endpoint that hits the 30 second gunicorn timeout, which causes a child process to get killed with a SystemExit(1) gunicorn/workers/base.py ~ln 203

    def handle_abort(self, sig, frame):
        self.alive = False
        self.cfg.worker_abort(self)
        sys.exit(1)

The error/event shows in Sentry but not in APM. I believe this is because the parent gunicorn process does a hard kill on the child process (kill 9 SIGKILL) the current APM transaction isnt ended/finished.

Describe the solution you'd like For APM to finish the transaction along with the error as an uncaught exception automagically.

Describe alternatives you've considered Looking at gunicorn's hooks for on worker termination and doing something like this to handle it manually

https://docs.gunicorn.org/en/stable/settings.html#worker-abort gunicorn.conf.py

from elasticapm import traces


def worker_abort(worker):
    if span := traces.execution_context.get_span():
        span.end()
    
    if transaction := traces.execution_context.get_transaction():
        transaction.end()

If this is a bad ideal thought let me know just something thought off off the cuff.

m-standfuss avatar Mar 08 '22 21:03 m-standfuss

Was having some weirdness trying to close the transaction at that moment. Ended up just logging a message to it so at least we can see an error in APM, the transaction is still showing as 200 success though.

import elasticapm


def worker_abort(worker):
    if (client := elasticapm.get_client()) and elasticapm.get_transaction_id():
        # If apm is enabled and there is a current transaction log a message to it that the worker aborted.
        client.capture_message("Gunicorn aborted worker")

m-standfuss avatar Mar 09 '22 16:03 m-standfuss

Yeah we'll have to do some testing/research to figure out how to handle this properly. But thanks for the report, definitely a case we want to handle gracefully.

basepi avatar Mar 16 '22 15:03 basepi

are there any updates on that?

leikoilja avatar Oct 06 '23 11:10 leikoilja

No updates yet, sorry.

basepi avatar Oct 09 '23 18:10 basepi

thanks for the update @basepi. Is it anything scoped/prioritised or planned on your roadmap? :)

leikoilja avatar Oct 11 '23 09:10 leikoilja

This issue has been pretty quiet, so it hasn't been on our radar. But thanks for bumping it to the top.

Part of the problem is I'm not exactly sure how to fix this. I wonder how Sentry deals with this issue? Because I'm not sure I can actually flush the agent transport on a SIGKILL.

basepi avatar Oct 11 '23 18:10 basepi

@basepi, is there any way to configure the agent to "send everything it has" to Elastic after X amount of seconds. Then i could configure the agent to stop collecting after 29seconds for example, since 30sec in the gunicorn will kill the threads. However, the problem is, that I can't quite use APM max timeout setting, since i would still want it to record all the transactions Celery tasks may have (which are definitely longer than 30 seconds)

leikoilja avatar Oct 12 '23 13:10 leikoilja

I assume you mean the api_request_time config? The problem is that it will not send a transaction which has not ended, so it likely would miss those long-running Celery transactions.

basepi avatar Oct 12 '23 21:10 basepi

oh dang it, i must have misunderstood many of these configuration options. So since api_request_time is default to 10s, it will not send transactions that happen in my celery tasks that take ~40sec for example? that settings will simply "quite" apm recording after 10s?

leikoilja avatar Oct 16 '23 06:10 leikoilja

That's not how it works; apologies if I wasn't clear.

If it has been greater than api_request_time since our last send to the APM Server, we will send any data that has been queued. Every time we send data, this timer is reset.

However, we can't send data that isn't complete, so we won't send a transaction document until that transaction has ended.

Celery jobs create separate transactions, which are then tied together via distributed tracing. So it doesn't matter if the original thread that created the job is killed after its transaction has ended -- the celery transaction is separate. I think that's what you were worried about, but please correct me if I'm wrong.

basepi avatar Oct 17 '23 18:10 basepi