apm-agent-python
apm-agent-python copied to clipboard
Django + Gunicorn integration on worker timeouts finish current transaction.
Is your feature request related to a problem? Please describe.
We are in the process of attempting to replace our Sentry integrations noticed that when we get the occasional endpoint that hits the 30 second gunicorn timeout, which causes a child process to get killed with a SystemExit(1)
gunicorn/workers/base.py ~ln 203
def handle_abort(self, sig, frame):
self.alive = False
self.cfg.worker_abort(self)
sys.exit(1)
The error/event shows in Sentry but not in APM. I believe this is because the parent gunicorn process does a hard kill on the child process (kill 9 SIGKILL) the current APM transaction isnt ended/finished.
Describe the solution you'd like For APM to finish the transaction along with the error as an uncaught exception automagically.
Describe alternatives you've considered Looking at gunicorn's hooks for on worker termination and doing something like this to handle it manually
https://docs.gunicorn.org/en/stable/settings.html#worker-abort
gunicorn.conf.py
from elasticapm import traces
def worker_abort(worker):
if span := traces.execution_context.get_span():
span.end()
if transaction := traces.execution_context.get_transaction():
transaction.end()
If this is a bad ideal thought let me know just something thought off off the cuff.
Was having some weirdness trying to close the transaction at that moment. Ended up just logging a message to it so at least we can see an error in APM, the transaction is still showing as 200 success though.
import elasticapm
def worker_abort(worker):
if (client := elasticapm.get_client()) and elasticapm.get_transaction_id():
# If apm is enabled and there is a current transaction log a message to it that the worker aborted.
client.capture_message("Gunicorn aborted worker")
Yeah we'll have to do some testing/research to figure out how to handle this properly. But thanks for the report, definitely a case we want to handle gracefully.
are there any updates on that?
No updates yet, sorry.
thanks for the update @basepi. Is it anything scoped/prioritised or planned on your roadmap? :)
This issue has been pretty quiet, so it hasn't been on our radar. But thanks for bumping it to the top.
Part of the problem is I'm not exactly sure how to fix this. I wonder how Sentry deals with this issue? Because I'm not sure I can actually flush the agent transport on a SIGKILL.
@basepi, is there any way to configure the agent to "send everything it has" to Elastic after X amount of seconds. Then i could configure the agent to stop collecting after 29seconds for example, since 30sec in the gunicorn will kill the threads. However, the problem is, that I can't quite use APM max timeout setting, since i would still want it to record all the transactions Celery tasks may have (which are definitely longer than 30 seconds)
I assume you mean the api_request_time config? The problem is that it will not send a transaction which has not ended, so it likely would miss those long-running Celery transactions.
oh dang it, i must have misunderstood many of these configuration options. So since api_request_time is default to 10s, it will not send transactions that happen in my celery tasks that take ~40sec for example? that settings will simply "quite" apm recording after 10s?
That's not how it works; apologies if I wasn't clear.
If it has been greater than api_request_time since our last send to the APM Server, we will send any data that has been queued. Every time we send data, this timer is reset.
However, we can't send data that isn't complete, so we won't send a transaction document until that transaction has ended.
Celery jobs create separate transactions, which are then tied together via distributed tracing. So it doesn't matter if the original thread that created the job is killed after its transaction has ended -- the celery transaction is separate. I think that's what you were worried about, but please correct me if I'm wrong.