dd-trace-py icon indicating copy to clipboard operation
dd-trace-py copied to clipboard

Intermittent `RuntimeError: the memalloc module was not started` error

Open cyface opened this issue 2 years ago • 7 comments

Which version of dd-trace-py are you using?

ddtrace==0.57.0

What is the result that you get?

RuntimeError: the memalloc module was not started

148817478-801fc910-af78-4f45-a06c-b77e8924aeb7

What is the result that you expected?

No errors.

This seems to be happening a few times a day.

We have tried setting DD_PROFILING_HEAP_ENABLED=False and DD_PROFILING_MEMALLOC=0 in the environment, but the errors continue to appear.

Configuration in Django:

import os
from ddtrace import config, tracer

# DataDog Setup
tracer.configure(hostname=os.environ.get("HOST_IP"))
tracer.configure(enabled=True)
tracer.set_tags(
    {"env": os.environ.get("ENVIRONMENT"), "namespace": os.environ.get("NAMESPACE")}
)
config.django["analytics_enabled"] = True
config.django["cache_service_name"] = "xxx-cache"
config.django["database_service_name_prefix"] = "xxx"
config.django["distributed_tracing_enabled"] = True
config.django["instrument_middleware"] = True
config.django["service_name"] = "xxx"

cyface avatar Jan 10 '22 18:01 cyface

We are facing the same issue (same ddtrace version) with a Fastapi app. I'd be happy to share logs/config you need.

arthurio avatar Jan 12 '22 06:01 arthurio

Could you share:

  1. which Web server you're running?
  2. Its configuration?
  3. If you're using gevent?

jd avatar Jan 17 '22 12:01 jd

I am using nginx feeding into gunicon/gevent with regular proxy setup. Not sure which configurations would be helpful to you?

cyface avatar Jan 17 '22 16:01 cyface

We have 2 different servers (not behind nginx).

flask

  1. Gunicorn
  2. --worker-class=gthread
  3. No

fastapi

  1. Gunicorn
  2. --worker-class=uvicorn.workers.UvicornWorker
  3. No

arthurio avatar Jan 18 '22 00:01 arthurio

+1 We have the same issue with Gunicorn + Django

roniemartinez avatar Jan 24 '22 14:01 roniemartinez

+1 same in 0.59.1 we are using sanic with profiling enabled: DD_PROFILING_ENABLED=true

datadog~=0.44.0
ddtrace~=0.59.1
requests~=2.27.1
sanic~=19.12.2
image

visaals avatar Mar 18 '22 20:03 visaals

Currently, having the same issue

FlavioAlexander avatar Jun 28 '22 16:06 FlavioAlexander

Same issue, with Flask/gunicorn/gevent, using ddtrace 1.2.1:


RuntimeError: the memalloc module was not started
  File "ddtrace/internal/periodic.py", line 70, in run
    self._target()
  File "ddtrace/profiling/collector/__init__.py", line 42, in periodic
    for events in self.collect():
  File "ddtrace/profiling/collector/memalloc.py", line 145, in collect
    events, count, alloc_count = _memalloc.iter_events()

luislew avatar Oct 20 '22 23:10 luislew

Same issue with Flask/gunicorn/gevent as well using ddtrace 1.4.4: image

rgilkey avatar Nov 14 '22 15:11 rgilkey

This might be due to gevent monkey patching not done properly. Is everyone using DD_GEVENT_PATCH_ALL=1?

jd avatar Nov 15 '22 14:11 jd

@jd we're working on adding DD_GEVENT_PATCH_ALL in as recommended by Datadog support. we also need to upgrade some other some deps along the way to make it work. i'll post back if that gets us resolved.

rgilkey avatar Nov 15 '22 15:11 rgilkey

@rgilkey did adding DD_GEVENT_PATCH_ALL resolve the issue? I'm facing the same problem with ddtrace v1.7.5

jorgemo-fever avatar Feb 20 '23 08:02 jorgemo-fever

We are currently facing the same issue, and it seems to be a bit random, it appears really punctually... We use Flask + Gunicorn and ddtrace==1.0.1

LPauzies avatar Feb 22 '23 09:02 LPauzies

'Me too'ing this for Gunicorn, Uvicorn, FastaAPI with dd-trace 1.10.2

swingingsimian avatar Apr 06 '23 14:04 swingingsimian

@swingingsimian can I please check with you whether you are seeing

RuntimeError: the memalloc module was not started

or

RuntimeError: the memalloc module is already started

or both?

P403n1x87 avatar Apr 17 '23 09:04 P403n1x87

I'm facing this issue, weirdly - only seeing it in our staging cluster and not in production. Only difference is more replicas and higher cpu limits in our production pods.

FastAPI, Gunicorn

gunicorn = "^20.1.0"
uvicorn = "^0.17.0"
[2023-04-18 21:41:23 +0000] [29] [INFO] Booting worker with pid: 29
Failed to start collector MemoryCollector(status=<ServiceStatus.STOPPED: 'stopped'>, recorder=Recorder(default_max_events=16384, max_events={<class 'ddtrace.profiling.collector.stack_event.StackSampleEvent'>: 30000, <class 'ddtrace.profiling.collector.stack_event.StackExceptionSampleEvent'>: 15000, <class 'ddtrace.profiling.collector.memalloc.MemoryAllocSampleEvent'>: 1920, <class 'ddtrace.profiling.collector.memalloc.MemoryHeapSampleEvent'>: None}), _max_events=16, max_nframe=64, heap_sample_size=1048576, ignore_profiler=False), disabling.
Traceback (most recent call last):
  File "/opt/pysetup/.venv/lib/python3.11/site-packages/ddtrace/profiling/profiler.py", line 266, in _start_service
    col.start()
  File "/opt/pysetup/.venv/lib/python3.11/site-packages/ddtrace/internal/service.py", line 58, in start
    self._start_service(*args, **kwargs)
  File "/opt/pysetup/.venv/lib/python3.11/site-packages/ddtrace/profiling/collector/memalloc.py", line 108, in _start_service
    _memalloc.start(self.max_nframe, self._max_events, self.heap_sample_size)
RuntimeError: the memalloc module is already started

lfvarela avatar Apr 18 '23 22:04 lfvarela

We are also seeing this behavior as of yesterday in our development cluster:

Traceback (most recent call last):
  File "/app/python/<app>/wsgi_image.binary.runfiles/pypi_ddtrace/site-packages/ddtrace/profiling/profiler.py", line 266, in _start_service
   col.start()
  File "/app/python/<app>/wsgi_image.binary.runfiles/pypi_ddtrace/site-packages/ddtrace/internal/service.py", line 58, in start
   self._start_service(*args, **kwargs)
 File "/app/python/<app>/wsgi_image.binary.runfiles/pypi_ddtrace/site-packages/ddtrace/profiling/collector/memalloc.py", line 108, in _start_service
   _memalloc.start(self.max_nframe, self._max_events, self.heap_sample_size)
RuntimeError: the memalloc module is already started
[2023-04-18 22:45:49 +0000] [72] [INFO] Booting worker with pid: 72

I believe it was triggered after our nodegroups cycled out in dev, the containers don't want to come back online. In this case many of them are stuck in crashloopbackoff for a period of time. It does eventually stabilize... maybe this is an API issue on datadog's end? We are running ddtrace 1.12.0

joesteffee avatar Apr 18 '23 22:04 joesteffee

Thanks for the reports @lfvarela @joesteffee. This exception is handled and so the only issue caused is that memory allocations won't be profiled. I think I have found the source of the problem, but can I please confirm with you that your services work and that this is just undesirable noise in your logs?

P403n1x87 avatar Apr 19 '23 09:04 P403n1x87

The new reports are a different issue that is a manifestation of a regression. The fix is in #5586.

P403n1x87 avatar Apr 19 '23 10:04 P403n1x87

Thanks! To confirm - our servers couldn't even start. Weirdly - by increasing our CPU limits in our servers we stopped seeing the issue.

lfvarela avatar Apr 20 '23 06:04 lfvarela