[BUG]: Memory Leak for temporal worker processes
Tracer Version(s)
5.51.0
Node.js Version(s)
22.13.1
Bug Report
Related: https://github.com/DataDog/dd-trace-js/issues/5554
We've been experiencing memory leaks for our temporal worker processes. The last known "good" version of dd-trace for us is v5.28.0 - though we haven't exhaustively tried every patch version since then.
We're using the https://www.npmjs.com/package/@temporalio/interceptors-opentelemetry package to expose metrics from temporal, example annotation:
podAnnotations:
ad.datadoghq.com/energy-device-service.checks: |
{
"openmetrics": {
"init_configs": {},
"instances": [
{
"openmetrics_endpoint": "http://%%host%%:9464/metrics",
"metrics": [
"temporal_workflow_failed",
"temporal_workflow_completed",
"temporal_workflow_endtoend_latency"
]
}
]
}
}
The process also runs a koa server exposing a couple of HTTP health check endpoints, and uses the prisma ORM with the normal node-postgres driver.
We don't observe memory leaks on our pure REST (koa) deployments or GCP PubSub deployments - leading me to believe that the leak is specifically related to the use of the temporal SDK.
I've said bundled using webpack, because I believe that temporal does bundle the workflow code using webpack. We don't configure/invoke any bundling ourselves.
Reproduction Code
No response
Error Logs
No response
Tracer Config
No response
Operating System
No response
Bundling
Webpack
@mnahkies thank you for your report! We are going to look into this with high priority!
@BridgeAR , it's been over a month with this marked as high priority — has there been any update?
We're still facing persistent memory leaks, and it's seriously affecting our systems. It's frustrating to see no progress or communication after this long.
We are currently trying to gather more information about each individual cases of this memory leak issue. Some of what we need can be shared publicly on GitHub, but some would require a private channel, so ideally I would recommend opening a support ticket. Please feel free to share the ticket number in this issue or send it directly to me on our public Slack so that I can expedite the escalation process.
In the support ticket, please provide the following information:
- If the issue appeared after an upgrade, what version did the issue appear in?
- Please be as precise as possible in the exact version when the issue first appeared. This will allow us to isolate the code change that is responsible. For example, reporting that 5.0.0 works but 5.50.0 doesn't is not as helpful as knowing that 5.1.2 works but 5.1.3 doesn't.
- Since we had a different issue with runtime metrics in 5.41.1, please make sure to disable them with
DD_RUNTIME_METRICS_ENABLED=falsebefore any bisecting to avoid any false positive.- If disabling runtime metrics resolves the issue, let us know as well as that would mean the leak is there.
- If the issue happens with all other products disabled except tracing, the issue is likely in one of our integrations. I would recommend trying to disable individual integrations to isolate the issue to one of them. Integrations can be fully disabled with for example
DD_TRACE_INSTRUMENTATIONS_DISABLED=express,mysqlandDD_TRACE_PLUGINS_DISABLED=express,mysql. You can find the full list of integrations enabled for the service in startup logs (which can be enabled as described below) - Do you have any other services that have or don't have the issue?
- If yes, are there any obvious differences between the ones that do and the ones that don't?
- Please provide the following if possible:
- Your
package.json - Startup logs, which can be outputted by starting the service with
DD_TRACE_STARTUP_LOGS=true - [optional] Debug logs, which can be outputted by starting the service with
DD_TRACE_DEBUG=true.- Note: this is extremely verbose, so enable this with caution, ideally in a dev or staging environment.
- [optional] Two heap dumps, one after 1h of starting the service and another one 2h after.
- If you can provide even more heap dumps, for example after waiting another hour and calling
gc()a few time that's even better. Thegcfunction can be exposed by starting the service withNODE_OPTIONS='--expose-gc', and it needs to be called more than once for a full GC to happen.
- If you can provide even more heap dumps, for example after waiting another hour and calling
- Any other information you deem relevant about your environment or the application itself.
- Your
If you know of a version that works for you and doesn't have the memory leak, please keep using it for now until we update this issue with a resolution.
Thank your for your patience and understanding as we're investigating this issue.