dd-trace-py icon indicating copy to clipboard operation
dd-trace-py copied to clipboard

Memory corruption and SIGSEGV/SIGABRT with patching of psycopg

Open kravnaut opened this issue 2 years ago • 26 comments

Summary of problem

For a few weeks we started seeing sporadic SIGSEGV/SIGABRT in our celery deployment as well as gunicorn app that uses sqlalchemy over psycopg to talk to postgres. We kept trying to narrow down the related change in our infrastructure/code. We added faulthandler and noticed that most of the crashes occurred within the tracing of the psycopg call. That itself is not super surprising as we are heavily using postgres through sqlalchemy+psycopg. However, after we decided to disable specifically psycopg from tracing by using DD_PATCH_MODULES="psycopg:false" the errors vanished in both celery and gunicorn deployment.

Which version of dd-trace-py are you using?

We were on version ddtrace==1.4.4 when we saw the issue. We've upgraded to ddtrace==1.14.0 and there was no difference - same errors.

Which version of pip are you using?

23.1.2

Which libraries and their versions are you using?

Quite a few, we can through customer support channel if needed

How can we reproduce your problem?

Probably not easy to reproduce - issue is occurring with some frequency but with consistent load. It only happens in production env where we put enough volume and variety of requests.

What is the result that you get?

Ultimately we get SIGSEGV/SIGABRT with various tracebacks that tend to be somewhere in ddtrace typically. Some generic errors implying malloc errors,

File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2773 in all
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2916 in _iter
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1717 in execute
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1710 in _execute_20
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/sql/elements.py", line 334 in _execute_on_connection
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1577 in _execute_clauseelement
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1905 in _execute_context
File "/usr/local/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 736 in do_execute
File "/usr/local/lib/python3.10/site-packages/ddtrace/contrib/dbapi/__init__.py", line 155 in execute
File "/usr/local/lib/python3.10/site-packages/ddtrace/contrib/psycopg/cursor.py", line 14 in _trace_method
File "/usr/local/lib/python3.10/site-packages/ddtrace/contrib/dbapi/__init__.py", line 91 in _trace_method
File "/usr/local/lib/python3.10/site-packages/ddtrace/tracer.py", line 842 in trace
File "/usr/local/lib/python3.10/site-packages/ddtrace/tracer.py", line 724 in _start_span
File "/usr/local/lib/python3.10/site-packages/ddtrace/provider.py", line 129 in activate

Some errors clearly implying memory corruption in random places:

...
  File "/usr/local/lib/python3.10/uuid.py", line 268, in __hash__
    return hash(self.int)
AttributeError: 'UUID' object has no attribute 'int'

few instances of double free or corruption (fasttop) and malloc(): unsorted double linked list corrupted etc - garden variety

What is the result that you expected?

Not this 😆

kravnaut avatar Jun 08 '23 00:06 kravnaut

Thank you for the detailed writeup, @kravnaut.

My first instinct is that this could be related to https://github.com/DataDog/dd-trace-py/issues/5993.

emmettbutler avatar Jun 12 '23 19:06 emmettbutler

Hello 👋 We are experiencing a bit of similar issues after upgrading to Python 3.11. On Python 3.10 everything works without issues, but once we upgrade to Python 3.11, we're experiencing regular SIGSEGVs on websocket servers based on daphne library. I tried to debug it further with faulthandler, inspecting core dump or turning off ddtraces debug, but unfortunately didn't find anything useful there.

We believe the issue is caused be ddtrace because:

  1. Turning off ddtrace resolves the problem. Turning it on again results in regular segfaults (roughly at least 1 segfault per hour in our case)
  2. Never encountered and didn't manage to succesfully reproduce in local/development environments where ddtrace is disabled.

I'm happy to share more context around our specific setup, so please let me know if there's any information that you'll find useful. Below you can find some generic information about versions, etc

OS: Debian GNU/Linux 11 (bullseye)
Kernel: 5.4.249-163.359.amzn2.aarch64
Python version: 3.11.4
ddtrace version: 1.17.3

Also, let me know if I should turn it into a separate issue as this one is referencing psycopg specifically in the title.

pgrzesik avatar Aug 09 '23 15:08 pgrzesik

I can confirm we're experiencing the same issue since migrating our Lambda functions to Python 3.11.

What we also noticed is that the segfault only happens if there was an unhandled exception during function execution, and the Lambda handler function has a decorator. It can be any decorator, even a simple no-op one. Our debugging efforts only lead us to find out that the issue occurs somewhere deep within ddtrace and setting DD_TRACE_ENABLED to false is the only "solution" we've found so far.

mLupine avatar Aug 30 '23 11:08 mLupine

We're seeing the same thing in multiple services across multiple versions of python 3.11 and dd-trace. We don't use psycopg in any of those services.

WarpRat avatar Sep 21 '23 21:09 WarpRat

@kravnaut @mLupine @pgrzesik @WarpRat May I check with you whether the profiler is also enabled? We are aware of similar issues that might be caused by the profiler on CPython 3.11. So if you have the profiler enabled, would disabling it solve your problem? We expect this to be the case, but would love your confirmation 🙏 .

P403n1x87 avatar Sep 22 '23 10:09 P403n1x87

Hey @P403n1x87 - yes, we came to the same conclusion when talking with support directly. Unfortunately, turning off a useful feature is not the way to go in the long run. Recently, we've been checking if that issue has been solved in Python 3.11.5, but that didn't help. Our current hypothesis is that the bug might be actually in CPython implementation, but the reproduction is not necessarily easy.

pgrzesik avatar Sep 22 '23 10:09 pgrzesik

Unfortunately, turning off a useful feature is not the way to go in the long run.

Absolutely. We're actively working on our side to get this resolved ASAP.

Our current hypothesis is that the bug might be actually in CPython implementation, but the reproduction is not necessarily easy.

Indeed this is also what our investigation is leaning towards, but we are committed to getting to the bottom of this so that we can provide a product that works. Please bear with us 🙏 .

P403n1x87 avatar Sep 22 '23 10:09 P403n1x87

@P403n1x87 We’ve been seeing Seg faults with python 3.11 and ddtrace frequently with our Django application. Took about a week of configuration testing to determine it was the profiler. Since disabling the profiler we’ve had no issues.

Please disclose these things in the known issues in the future so we can minimize the time we need to troubleshoot configuration issues.

nat45928 avatar Sep 22 '23 12:09 nat45928

@nat45928 I'm working with @P403n1x87 on this issue--I acknowledge that diagnosing the original cause for a segmentation fault can be a time-intensive experience. I think it's doubly frustrating because an observability tool managed by an organization like Datadog isn't (yet!) everyone's first suspicion for a segfault. Sorry for that. This is a new and fast-moving issue; in our other investigation, the understanding was that it was highly circumstantial to the customer's setup. I think this thread is demonstrating this is not the case. I'll raise the idea of marking this as a known issue--appreciate the thought!

sanchda avatar Sep 22 '23 15:09 sanchda

:wave: Another dev from Datadog's Profiling team here.

Several people in this thread have identified some extremely useful circumstantial evidence for the defect, but it would be extremely useful to gain a little bit more context into the source of the crash. If anyone has the time/flexibility to do so, it would help us remediate this issue more quickly if you could perform a diagnostic operation.

Instructions are here: https://github.com/DataDog/dd-trace-py/issues/6701#issuecomment-1717659333. Posting similar information as the responder in that other thread would be glorious. Thank you!

sanchda avatar Sep 22 '23 16:09 sanchda

@sanchda Before discovering this thread we were working towards being able to capture core dumps for further troublesshooting. Would privately sharing one of those be sufficient or is there additional information you'd get from adding that binary to our containers?

WarpRat avatar Sep 22 '23 19:09 WarpRat

@WarpRat, a corefile is better than my suggestion. That'd be wonderful.

Typically, submitting corefiles etc is something we do through our Support organization. You can submit a ticket at https://help.datadoghq.com/hc/en-us/requests/new. This is slightly annoying--I apologize for the extra step, but it would really, really help. If you mention Profiling (and possibly link this conversation), it should ping my team. This will give you a confidential channel to discuss the particular defect at your organization, plus it'll allow me to look at your organization's Datadog installation directly, which can have a lot of useful circumstantial information.

In order for me to make full use of a corefile, it would help to have some additional information:

  • container base image (or precise distro version)
  • machine architecture (x86/aarch64)
  • output of pip freeze or similar

Many, many thanks! If this is just too much extra work for you to do, I understand--let me know and we'll try to figure out the next-best-way of iterating.

sanchda avatar Sep 22 '23 19:09 sanchda

Updating the issue to point to myself and to indicate the faulty component as Profiling. We're still not 100% sure that Profiling is a fundamental cause, but we're the side that will own coordination on this end.

Thank you for your patience, everyone!

sanchda avatar Sep 22 '23 19:09 sanchda

Hi everyone,

Still working on this one. We're investigating a few angles related to wrapping. Is anyone able to test setting the WRAPT_DISABLE_EXTENSIONS=true environment variable? I'm maybe ~35% confident this will help anything, but it should be a quick and easy test.

Thanks

sanchda avatar Sep 26 '23 13:09 sanchda

I just tried the WRAPT_DISABLE_EXTENSIONS setting in my environment and still saw segmentation faults (~30min intervals).

I'm on K8s (Microk8s, Ubuntu 22.04.1, Linux 5.15.0-58-generic) with a Python Django workload running via Daphne.

nat45928 avatar Sep 26 '23 14:09 nat45928

@nat45928 many thanks for trying that out! That eliminates one potential angle.

sanchda avatar Sep 26 '23 15:09 sanchda

Hi again, everyone.

I have an x86_64+glibc release of ddtrace available, if anyone is interested in trying it out here. I had to host it out of a personal repo (sorry!), since we don't have a great workflow for sharing out-of-band custom builds.

Here's the URL. You'll have to consult with the docs of your particular package management system to see how to consume this. https://github.com/sanchda/test_gcr/releases/download/v1.16.1_innerapi/ddtrace-2.1.0.dev34+g9bd1fb9bb-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

Another thing that would be useful is setting the PYTHONFAULTHANDLER environment variable to true. This will surface the Python context at time-of-crash, which will give us some hints to go by.

I'm not very optimistic that this will fix the problem for anyone, but it might yield some useful diagnostic information. Thank you for your patience.

sanchda avatar Sep 26 '23 15:09 sanchda

@kravnaut @mLupine @pgrzesik @WarpRat May I check with you whether the profiler is also enabled? We are aware of similar issues that might be caused by the profiler on CPython 3.11. So if you have the profiler enabled, would disabling it solve your problem? We expect this to be the case, but would love your confirmation 🙏 .

We are not using profiler. And as an update, after python 3.11 upgrade we also started to get sporadic segmentation faults...

kravnaut avatar Sep 29 '23 17:09 kravnaut

Thanks for confirming @kravnaut. Your original issue is likely related to tracing. Anybody else using OpenAI and tracing are affected by an issue that should be resolved by #7044. There still is a chance that the same PR fixes the original issue, since the fix is for a core component also used in other tracing integrations. So it would be worth trying the latest release with the fix that should come out soon. If that doesn't fix it, the tracing team will keep investigating this.

P403n1x87 avatar Sep 29 '23 18:09 P403n1x87

We are using ddtrace 2.0.2 with profiling enabled, and see segfaults. With profiling disabled they go away.

Not using OpenAI, we are using psycopg2. Will try and get PYTHONFAULTHANDLER set to see if we can get y'all some logs.

shughes-uk avatar Oct 12 '23 17:10 shughes-uk

Hi - chiming in here to note that we're also seeing segfaults across both our instances running celery and those serving web requests via uvicorn (ddtrace 2.0.2). Turning off profiling appears to resolve the issue.

jlucas91 avatar Oct 16 '23 21:10 jlucas91

Still occurring after 2.1.1. Having trouble getting PYTHONFAULTHANDLER output, the way it logs it seems to not behave well with the way dd-agent gets logs from docker containers?

shughes-uk avatar Oct 16 '23 22:10 shughes-uk

@jlucas91 @shughes-uk Thanks for the extra reports. Would you be able to test whether turning the tracer off (or just the psycopg integration) while keeping the profiler on also resolves the issue?

P403n1x87 avatar Oct 17 '23 14:10 P403n1x87

I can try that, our production environment runs with profiling off and does not encounter these segfaults though.

I managed to get the output from one of the segfaults and it looks like the relevant part is

File "/opt/coiled/env/lib/python3.11/site-packages/ddtrace/profiling/collector/__init__.py", line 42 in periodic

The only other thread not just sitting on threading.wait was our code that checks memory usage, it was on a line that opens a file handler to /sys/fs/cgroup/memory.current so we can have a quick look at what cgroup says our memory usage is. Not sure if y'all interact with that or not.

shughes-uk avatar Nov 20 '23 18:11 shughes-uk

@kravnaut @mLupine @pgrzesik @WarpRat May I check with you whether the profiler is also enabled? We are aware of similar issues that might be caused by the profiler on CPython 3.11. So if you have the profiler enabled, would disabling it solve your problem? We expect this to be the case, but would love your confirmation 🙏 .

Sorry for the late reply here. We are (were) using profiling and can confirm that disabling it resolves the seg fault issue. It has been an acceptable work around but we're definitely eager to reenable profiling when we can. We're also using fastapi/uvicorn for all the services where we see this. This looked promising, but we're on 3.11 not 3.12 - will that fix potentially address this issue on 3.11 too?

WarpRat avatar Dec 05 '23 19:12 WarpRat

@WarpRat many thanks for your feedback and patience. The PR you mentioned, and that we recently merged, affects CPython 3.11 too, so it is definitely worth a shot. We currently have work in progress to further improve our support of the latest CPython releases, that is 3.11 and 3.12, so we expect to get to a position where this and similar issues are resolved with confidence in the coming months.

P403n1x87 avatar Dec 05 '23 19:12 P403n1x87

This issue has been automatically closed after a period of inactivity. If it's a feature request, it has been added to the maintainers' internal backlog and will be included in an upcoming round of feature prioritization. Please comment or reopen if you think this issue was closed in error.

github-actions[bot] avatar May 08 '24 00:05 github-actions[bot]