High CPU utilization causing kubernetes pod scaling with ddtrace > 2.3.0
Summary of problem
We have noticed that after upgrading ddtrace to any version above 2.3.0, results in a significant increase in CPU utilization, which leds to the maximum number of replicas being deployed.
For instance, our Kubernetes application is configured with an auto-scaling limit of 36 maximum replicas. Prior to the upgrade, our stage environment would typically use only 6-8 pods while idle. However, post-upgrade, we are reaching the upper limit of 36 replicas.
This unexpected behavior suggests that there may be a spike in resource usage introduced in versions above 2.3.0. We would like to understand the cause of this increased resource consumption and seek a solution to optimize it.
Additionally, updated datadog_lambda==5.83.0 to be compatible with ddtrace==2.3.0 version.
( Maybe a red herring - we also noticed calls to POST /telemetry/proxy/api/v2/apmtelemetry increase on versions above 2.3.0 ).
Datadog screenshots (Kubernetes pods are in idle state):
on ddtrace 2.7.5:
sum:kubernetes_state.deployment.replicas_available{env:... ,service:...}
APM POST /telemetry/proxy/api/v2/apmtelemetry
on ddtrace 2.3.0:
sum:kubernetes_state.deployment.replicas_available{env:... ,service:...}
APM POST /telemetry/proxy/api/v2/apmtelemetry
Which version of dd-trace-py are you using?
Originally had bumped to 2.7.5, but now downgraded to 2.3.0. Have also tried with latest 2.8.5.
Which version of pip are you using?
pip 24.0
Spike with:
Any version above ddtrace 2.3.0
pip freeze
aioboto3==9.5.0
aiobotocore==2.2.0
aiodns==3.0.0
aiohttp==3.9.5
aiohttp-retry==2.4.5
aioitertools==0.8.0
aioredis==1.3.1
aioredis-cluster==1.5.2
aiosignal==1.2.0
ansible==9.1.0
ansible-core==2.16.4
asgiref==3.8.0
asn1crypto==1.5.1
async-kinesis==1.1.5
async-timeout==4.0.2
asyncio-throttle==1.0.2
atomicwrites==1.4.0
attrs==20.3.0
aws-kinesis-agg==1.1.3
aws-xray-sdk==2.6.0
awscli==1.22.76
bcrypt==3.2.0
black==24.4.2
blinker==1.7.0
boto==2.45.0
boto3==1.21.21
botocore==1.24.21
Brotli==1.0.9
brotlipy==0.7.0
bytecode==0.15.1
CacheControl==0.12.6
cachetools==4.1.1
cattrs==22.2.0
certifi==2023.7.22
cffi==1.16.0
chardet==3.0.4
charset-normalizer==2.0.8
cityhash==0.4.7
click==8.1.7
colorama==0.4.1
coverage==7.0.4
cryptography==42.0.5
dal-admin-filters==1.1.0
datadog==0.41.0
datadog_lambda==5.91.0
ddsketch==2.0.4
ddtrace==2.7.4
decorator==4.4.2
defusedxml==0.7.1
Deprecated==1.2.14
deprecation==2.1.0
Django==4.2.11
django-auditlog==3.0.0
django-autocomplete-light==3.11.0
django-cleanup==6.0.0
django-cors-headers==3.7.0
django-csp==3.7
django-discover-runner==1.0
django-extensions==3.1.5
django-filter==2.4.0
django-health-check==3.18.1
django-hosts==5.1
django-json-widget==2.0.1
django-nested-admin==3.4.0
django-redis==4.11.0
django-rest-serializer-field-permissions==4.1.0
django-role-permissions==2.2.0
django-rq==2.10.2
django-ses==3.5.0
django-snowflake==4.2.2
django-storages==1.12.3
django-webpack-loader==0.5.0
django_reverse_admin==2.9.6
djangorestframework==3.14.0
djangorestframework-csv==2.1.0
djangorestframework-gis==0.18
dnspython==2.6.1
docutils==0.15.2
dogslow==1.2
drf-flex-fields==0.9.8
drf-jwt==1.19.2
elementpath==2.2.3
envier==0.5.1
et-xmlfile==1.1.0
execnet==1.9.0
fakeredis==2.7.1
filelock==3.12.2
frozenlist==1.4.1
future==0.18.3
geojson==2.4.1
googleapis-common-protos==1.53.0
grpcio==1.62.0
grpcio-health-checking==1.62.0
grpcio-reflection==1.62.0
grpcio-status==1.62.0
gunicorn==22.0.0
hiredis==2.3.2
httplib2==0.19.0
idna==3.7
importlib-metadata==6.11.0
importlib-resources==5.8.0
iniconfig==2.0.0
intervaltree==3.1.0
isort==5.13.2
Jinja2==3.1.3
jmespath==0.10.0
json-stream==2.3.2
json-stream-rs-tokenizer==0.4.25
jsonpickle==3.0.3
jsonschema==4.5.1
magicattr==0.1.5
MarkupSafe==2.1.1
more-itertools==8.6.0
msgpack==1.0.0
multidict==5.1.0
mypy-extensions==1.0.0
nplusone==1.0.0
openpyxl==3.0.7
opentelemetry-api==1.23.0
orjson==3.9.15
packaging==24.0
paramiko==3.4.0
pathspec==0.12.1
pillow==10.3.0
platformdirs==3.8.1
pluggy==1.0.0
protobuf==4.21.7
psycopg2==2.9.9
psycopg2-binary==2.9.9
py-dateutil==2.2
pyasn1==0.4.8
pycares==4.2.0
pycodestyle==2.5.0
pycountry==22.3.5
pycparser==2.20
PyJWT==2.4.0
PyNaCl==1.5.0
pyOpenSSL==24.0.0
pyparsing==2.4.7
pyrsistent==0.18.1
pytest==7.2.0
pytest-cov==4.0.0
pytest-django==4.5.2
pytest-shard==0.1.2
pytest-xdist==3.1.0
python-dateutil==2.8.0
python-json-logger==0.1.8
python-memcached==1.59
python-monkey-business==1.0.0
pytz==2020.4
PyYAML==5.3.1
redis==3.5.3
redis-py-cluster==2.1.3
requests==2.31.0
resolvelib==0.5.4
rq==1.14.0
rsa==4.7
s3transfer==0.5.0
setproctitle==1.1.10
Shapely==1.6.4
simplejson==3.14.0
six==1.16.0
snowflake-connector-python==3.7.1
sortedcontainers==2.4.0
splunk-handler==2.0.7
sqlparse==0.5.0
tenacity==6.2.0
tomlkit==0.12.1
typing_extensions==4.7.1
unicodecsv==0.14.1
urllib3==1.26.18
Werkzeug==3.0.1
whitenoise==6.0.0
wrapt==1.14.0
xmlschema==1.2.5
xmltodict==0.13.0
yarl==1.9.4
zipp==3.18.1
How can we reproduce your problem?
I'm not sure how you can replicate the issue from your end. We are utilizing Datadog tools, and we have established metrics that continuously monitor and provide results whether in an idle or running.
What is the result that you get?
High CPU utilization causing Kubernetes pod scaling upto the max replicas even in idle condition, on ddtrace > 2.3.0.
What is the result that you expected?
CPU utilization and Kubernetes pod scaling only as much as required, on ddtrace > 2.3.0
Thank you for reporting this, @hemantgir. Could you share all relevant environment variables set in the app environment? This will help us understand what bits of Datadog functionality are enabled and disabled in this case.
Thank you for reporting this, @hemantgir. Could you share all relevant environment variables set in the app environment? This will help us understand what bits of Datadog functionality are enabled and disabled in this case.
Thank you for your response. Please find the list of environment variables below:
DD_DBM_PROPAGATION_MODE : disabled DD_DJANGO_USE_HANDLER_RESOURCE_FORMAT : True DD_ENV : stage DD_LOGS_INJECTION : True DD_SERVICE : Django DD_TRACE_SAMPLE_RATE : 1 DD_TRACE_SAMPLING_RULES : [{"sample_rate": 1}]
Did you ever figure this out? @hemantgir
Hi there,
I am impacted by this issue as well - Python service running on Kubernetes. Upgrading from 2.7.2. We were able to upgrade till 2.8.0 without the cpu spike hitting us.
Tried 2.14.2, 2.10.0, 2.9.2 - All these versions caused the initial cpu spike.
Any updates on this? Pretty much blocks us from upgrading ddtrace any further.
Accidentally close this issue and i don't have permission to reopen this. can someone please reopen this issue again @emmettbutler @DataDog @Kyle-Verhoog .
What Python version do you use?
We are using python "3.10.14".
We were seeing very minor cpu spikes until we upgraded from 2.7.2 -> 2.14.2, 2.10.0, 2.9.2. After which the spike was much bigger and stayed for much longer.
2.8.0, 2.8.1 sent it back to 2.7.2 levels
Also seeing this after going from 2.7.4 -> 2.21.0
in case it's connected, I made this bug the other day https://github.com/DataDog/dd-trace-py/issues/12370
Any updates? We see this issue as well.
We ran into a similar issue with high memory and CPU usage on application startup after enabling ddtrace.
To isolate the cause, we disabled all optional product plugins using the following env vars:
export DD_APPSEC_ENABLED=false export DD_IAST_ENABLED=false export DD_EXCEPTION_REPLAY_ENABLED=false export DD_DYNAMIC_INSTRUMENTATION_ENABLED=false export DD_INSTRUMENTATION_TELEMETRY_ENABLED=false export DD_ERROR_TRACKING_ENABLED=false export DD_LIVE_DEBUGGING_ENABLED=false export DD_REMOTE_CONFIGURATION_ENABLED=false export DD_CODE_ORIGIN_ENABLED=false export DD_SYMBOL_DATABASE_ENABLED=false
However, the main issue appeared to be a CPU spike during startup, caused either by Gunicorn, Flask, or our own application code (still under investigation).
Our pods had no CPU limits, and during deployments, multiple pods would roll out in parallel spiking cpu and causing hpa to initiate scale out causing more cpu spikes, overwhelming the nodes.
We fixed the issue by: • Setting appropriate CPU requests and limits • Adjusting our rolling update strategy • Tuning HPA scaling behavior
After these changes, the service is running stably with ddtrace enabled.
We are experiencing the same issue in kubernetes with uvicorn, Problem get worse when enabling more features. Any update on this?
So interestingly, we've been experiencing the same issue.
Python 3.12, ddtrace version 2.9.2
I found a profile related to an incident we recently had where requests were timing out. We have both tracing, and more recently cpu profiling configured. See screenshot but looked like most of cpu time was spent in on_span_start and on_span_finish methods part of the ddtrace library. Can someone add a bit more color as to how those methods get used and if there's something there that might be causing cpu spikes?
This issue has been automatically closed after a period of inactivity. If it's a feature request, it has been added to the maintainers' internal backlog and will be included in an upcoming round of feature prioritization. Please comment or reopen if you think this issue was closed in error.