dd-trace-java icon indicating copy to clipboard operation
dd-trace-java copied to clipboard

Add long running traces to flare report, allow flare files to be downloaded with JMX

Open deejgregor opened this issue 2 months ago • 4 comments

What Does This Do

This does two main things:

  1. Adds long running traces to the flare report.
  2. Allow flare dumps and individual files from flares to be downloaded with JMX.

There are some other small additions, as well, each in its own commit. If some of this isn't desirable and should be rebased out or should be split into a separate PR, I'm happy to do so--just let me know. I would really like to at least get the long running traces added to the flare report.

Motivation

While adding custom instrumentation to a complex, asynchronous application we found it was challenging to validate if all spans were end()ed during tests. dd.trace.debug=true and dd.trace.experimental.long-running.enabled=true could be used with some post-processing of debug logs, however this didn't work for our needs because the application breaks with that level of logging. When dd.trace.experimental.long-running.enabled=true is used, the long running traces are sent to Datadog's backend, however they are not searchable until they are finished, so we didn't have a good way to find them. This change gives us two ways to access the long running traces list with either a flare report or via JMX.

I initially started by adding JMX MBeans to retrieve just the pending and long running traces and counters. Once I added the long running traces to the flare report to parity with pending traces, I realized that a more generic mechanism to allow getting flare details over JMX might be useful. After adding a TracerFlare MBean, this seemed like a far more valuable route and I removed the code I had added for pending/long running trace MBeans.

Additional Notes

An easy way to enable this for testing is to add these arguments to a JVM with the APM tracer:

    -Ddd.telemetry.jmx.enabled=true
    -Dcom.sun.management.jmxremote
    -Dcom.sun.management.jmxremote.host=127.0.0.1
    -Dcom.sun.management.jmxremote.port=9010
    -Dcom.sun.management.jmxremote.authenticate=false
    -Dcom.sun.management.jmxremote.ssl=false

You can use this with jmxterm as shown in the examples below.

Example output:

$ echo "run -b datadog.flare:type=TracerFlare getFlareFile datadog.trace.agent.core.LongRunningTracesTracker long_running_traces.txt" |  \
    java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \
         -jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent
[{"service":"pending-traces-test","name":"step-3","resource":"step-3","trace_id":1110088093037488208,"span_id":3740396906142869284,"parent_id":6982939151275616389,"start":1761670337688000209,"duration":0,"error":0,"metrics":{"step.number":3,"trace.number":1,"thread.id":30},"meta":{"thread.name":"Trace-1"}},{"service":"pending-traces-test","name":"step-2","resource":"step-2","trace_id":1110088093037488208,"span_id":6468860803773086654,"parent_id":6982939151275616389,"start":1761670337582715042,"duration":0,"error":0,"metrics":{"step.number":2,"trace.number":1,"thread.id":30},"meta":{"thread.name":"Trace-1"}},{"service":"pending-traces-test","name":"step-1","resource":"step-1","trace_id":1110088093037488208,"span_id":1210573307183346962,"parent_id":6982939151275616389,"start":1761670337477268167,"duration":0,"error":0,"metrics":{"step.number":1,"trace.number":1,"thread.id":30},"meta":{"thread.name":"Trace-1"}}]
$ echo "run -b datadog.flare:type=TracerFlare generateFullFlareZip" | \
    java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \
        -jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent | \
    base64 -d > /tmp/flare.zip && \
    unzip -v /tmp/flare.zip
Archive:  /tmp/flare.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
      71  Defl:N       46  35% 10-28-2025 09:54 8963e853  flare_info.txt
      26  Defl:N       26   0% 10-28-2025 09:54 39f97d4e  tracer_version.txt
    9229  Defl:N     3316  64% 10-28-2025 09:54 f4c7920b  initial_config.txt
     487  Defl:N      231  53% 10-28-2025 09:54 f0284361  jvm_args.txt
      75  Defl:N       66  12% 10-28-2025 09:54 886a98a0  classpath.txt
     144  Defl:N       73  49% 10-28-2025 09:54 433c143d  library_path.txt
     307  Defl:N      170  45% 10-28-2025 09:54 773992bb  dynamic_config.txt
    1196  Defl:N      374  69% 10-28-2025 09:54 7396b38c  tracer_health.txt
      47  Defl:N       42  11% 10-28-2025 09:54 700f06af  span_metrics.txt
       0  Defl:N        2   0% 10-28-2025 09:54 00000000  pending_traces.txt
    2448  Defl:N      500  80% 10-28-2025 09:54 8b69071d  instrumenter_state.txt
      71  Defl:N       70   1% 10-28-2025 09:54 c84166ad  instrumenter_metrics.txt
     923  Defl:N      272  71% 10-28-2025 09:54 1f7f39aa  long_running_traces.txt
     213  Defl:N      130  39% 10-28-2025 09:54 eed91e78  dynamic_instrumentation.txt
       0  Defl:N        2   0% 10-28-2025 09:54 00000000  tracer.log
       0  Defl:N        2   0% 10-28-2025 09:54 00000000  jmxfetch.txt
--------          -------  ---                            -------
   15237             5322  65%                            16 files

Outstanding items

  • [x] Add an integration test that exercises JMX functionality when dd.telemetry.jmx.enabled=true.
  • [x] Limit the number of long running traces added to the flare report, like is already done for the pending trace buffer ( MAX_DUMPED_TRACES = 50).
  • [ ] Other updates from the list below?

This PR has a number of commits and I suggest reviewing commit-by-commit, paying special attention to the notes in bold below:

Note: I had a few fixups that I've merged into the above commits.

Contributor Checklist

Jira ticket: [PROJ-IDENT]

deejgregor avatar Oct 28 '25 17:10 deejgregor

Jira card for context: APMS-17557

aw-dd avatar Oct 29 '25 20:10 aw-dd

Hi DJ 👋 Thanks for your patience! Your notes and commit organization were really great for understanding this PR - I found them especially useful. I left two nit comments, but otherwise it looks good. Since this PR introduces some changes (e.g. keeping long running traces tracked in memory), I've brought it up for more sets of eyes ;). I'm out all of next week but will get back to you after if others don't beat me to it. Thanks again for the contribution!

sarahchen6 avatar Nov 21 '25 14:11 sarahchen6

Thanks, @sarahchen6 and @manuel-alvarez-alvarez! I'll address the few tweaks suggested. Updates coming shortly.

deejgregor avatar Nov 21 '25 19:11 deejgregor

Fixups made, rebased on latest master, and force pushed.

deejgregor avatar Nov 21 '25 22:11 deejgregor