dd-trace-js [BUG]: Memory Leak for temporal worker processes

Tracer Version(s)

5.51.0

Node.js Version(s)

22.13.1

Bug Report

Related: https://github.com/DataDog/dd-trace-js/issues/5554

We've been experiencing memory leaks for our temporal worker processes. The last known "good" version of dd-trace for us is v5.28.0 - though we haven't exhaustively tried every patch version since then.

We're using the https://www.npmjs.com/package/@temporalio/interceptors-opentelemetry package to expose metrics from temporal, example annotation:

  podAnnotations:
    ad.datadoghq.com/energy-device-service.checks: |
      {
        "openmetrics": {
          "init_configs": {},
          "instances": [
            {
              "openmetrics_endpoint": "http://%%host%%:9464/metrics",
              "metrics": [
                "temporal_workflow_failed",
                "temporal_workflow_completed",
                "temporal_workflow_endtoend_latency"
              ]
            }
          ]
        }
      }

The process also runs a koa server exposing a couple of HTTP health check endpoints, and uses the prisma ORM with the normal node-postgres driver.

We don't observe memory leaks on our pure REST (koa) deployments or GCP PubSub deployments - leading me to believe that the leak is specifically related to the use of the temporal SDK.

I've said bundled using webpack, because I believe that temporal does bundle the workflow code using webpack. We don't configure/invoke any bundling ourselves.

Reproduction Code

No response

Error Logs

No response

Tracer Config

No response

Operating System

No response

Bundling

Webpack

May 09 '25 10:05 mnahkies

@mnahkies thank you for your report! We are going to look into this with high priority!

May 09 '25 10:05 BridgeAR

@BridgeAR , it's been over a month with this marked as high priority — has there been any update?

We're still facing persistent memory leaks, and it's seriously affecting our systems. It's frustrating to see no progress or communication after this long.

Jun 03 '25 05:06 monwolf

We are currently trying to gather more information about each individual cases of this memory leak issue. Some of what we need can be shared publicly on GitHub, but some would require a private channel, so ideally I would recommend opening a support ticket. Please feel free to share the ticket number in this issue or send it directly to me on our public Slack so that I can expedite the escalation process.

In the support ticket, please provide the following information:

If the issue appeared after an upgrade, what version did the issue appear in?
- Please be as precise as possible in the exact version when the issue first appeared. This will allow us to isolate the code change that is responsible. For example, reporting that 5.0.0 works but 5.50.0 doesn't is not as helpful as knowing that 5.1.2 works but 5.1.3 doesn't.
- Since we had a different issue with runtime metrics in 5.41.1, please make sure to disable them with DD_RUNTIME_METRICS_ENABLED=false before any bisecting to avoid any false positive.
  - If disabling runtime metrics resolves the issue, let us know as well as that would mean the leak is there.
If the issue happens with all other products disabled except tracing, the issue is likely in one of our integrations. I would recommend trying to disable individual integrations to isolate the issue to one of them. Integrations can be fully disabled with for example DD_TRACE_INSTRUMENTATIONS_DISABLED=express,mysql and DD_TRACE_PLUGINS_DISABLED=express,mysql. You can find the full list of integrations enabled for the service in startup logs (which can be enabled as described below)
Do you have any other services that have or don't have the issue?
- If yes, are there any obvious differences between the ones that do and the ones that don't?
Please provide the following if possible:
- Your package.json
- Startup logs, which can be outputted by starting the service with DD_TRACE_STARTUP_LOGS=true
- [optional] Debug logs, which can be outputted by starting the service with DD_TRACE_DEBUG=true.
  - Note: this is extremely verbose, so enable this with caution, ideally in a dev or staging environment.
- [optional] Two heap dumps, one after 1h of starting the service and another one 2h after.
  - If you can provide even more heap dumps, for example after waiting another hour and calling gc() a few time that's even better. The gc function can be exposed by starting the service with NODE_OPTIONS='--expose-gc', and it needs to be called more than once for a full GC to happen.
- Any other information you deem relevant about your environment or the application itself.

If you know of a version that works for you and doesn't have the memory leak, please keep using it for now until we update this issue with a resolution.

Thank your for your patience and understanding as we're investigating this issue.

Jun 13 '25 15:06 rochdev