Add long running traces to flare report, allow flare files to be downloaded with JMX
What Does This Do
This does two main things:
- Adds long running traces to the flare report.
- Allow flare dumps and individual files from flares to be downloaded with JMX.
There are some other small additions, as well, each in its own commit. If some of this isn't desirable and should be rebased out or should be split into a separate PR, I'm happy to do so--just let me know. I would really like to at least get the long running traces added to the flare report.
Motivation
While adding custom instrumentation to a complex, asynchronous application we found it was challenging to validate if all spans were end()ed during tests. dd.trace.debug=true and dd.trace.experimental.long-running.enabled=true could be used with some post-processing of debug logs, however this didn't work for our needs because the application breaks with that level of logging. When dd.trace.experimental.long-running.enabled=true is used, the long running traces are sent to Datadog's backend, however they are not searchable until they are finished, so we didn't have a good way to find them. This change gives us two ways to access the long running traces list with either a flare report or via JMX.
I initially started by adding JMX MBeans to retrieve just the pending and long running traces and counters. Once I added the long running traces to the flare report to parity with pending traces, I realized that a more generic mechanism to allow getting flare details over JMX might be useful. After adding a TracerFlare MBean, this seemed like a far more valuable route and I removed the code I had added for pending/long running trace MBeans.
Additional Notes
An easy way to enable this for testing is to add these arguments to a JVM with the APM tracer:
-Ddd.telemetry.jmx.enabled=true
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.host=127.0.0.1
-Dcom.sun.management.jmxremote.port=9010
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
You can use this with jmxterm as shown in the examples below.
Example output:
$ echo "run -b datadog.flare:type=TracerFlare getFlareFile datadog.trace.agent.core.LongRunningTracesTracker long_running_traces.txt" | \
java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \
-jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent
[{"service":"pending-traces-test","name":"step-3","resource":"step-3","trace_id":1110088093037488208,"span_id":3740396906142869284,"parent_id":6982939151275616389,"start":1761670337688000209,"duration":0,"error":0,"metrics":{"step.number":3,"trace.number":1,"thread.id":30},"meta":{"thread.name":"Trace-1"}},{"service":"pending-traces-test","name":"step-2","resource":"step-2","trace_id":1110088093037488208,"span_id":6468860803773086654,"parent_id":6982939151275616389,"start":1761670337582715042,"duration":0,"error":0,"metrics":{"step.number":2,"trace.number":1,"thread.id":30},"meta":{"thread.name":"Trace-1"}},{"service":"pending-traces-test","name":"step-1","resource":"step-1","trace_id":1110088093037488208,"span_id":1210573307183346962,"parent_id":6982939151275616389,"start":1761670337477268167,"duration":0,"error":0,"metrics":{"step.number":1,"trace.number":1,"thread.id":30},"meta":{"thread.name":"Trace-1"}}]
$ echo "run -b datadog.flare:type=TracerFlare generateFullFlareZip" | \
java --add-exports jdk.jconsole/sun.tools.jconsole=ALL-UNNAMED \
-jar jmxterm-1.0.4-uber.jar -l localhost:9010 -n -v silent | \
base64 -d > /tmp/flare.zip && \
unzip -v /tmp/flare.zip
Archive: /tmp/flare.zip
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
71 Defl:N 46 35% 10-28-2025 09:54 8963e853 flare_info.txt
26 Defl:N 26 0% 10-28-2025 09:54 39f97d4e tracer_version.txt
9229 Defl:N 3316 64% 10-28-2025 09:54 f4c7920b initial_config.txt
487 Defl:N 231 53% 10-28-2025 09:54 f0284361 jvm_args.txt
75 Defl:N 66 12% 10-28-2025 09:54 886a98a0 classpath.txt
144 Defl:N 73 49% 10-28-2025 09:54 433c143d library_path.txt
307 Defl:N 170 45% 10-28-2025 09:54 773992bb dynamic_config.txt
1196 Defl:N 374 69% 10-28-2025 09:54 7396b38c tracer_health.txt
47 Defl:N 42 11% 10-28-2025 09:54 700f06af span_metrics.txt
0 Defl:N 2 0% 10-28-2025 09:54 00000000 pending_traces.txt
2448 Defl:N 500 80% 10-28-2025 09:54 8b69071d instrumenter_state.txt
71 Defl:N 70 1% 10-28-2025 09:54 c84166ad instrumenter_metrics.txt
923 Defl:N 272 71% 10-28-2025 09:54 1f7f39aa long_running_traces.txt
213 Defl:N 130 39% 10-28-2025 09:54 eed91e78 dynamic_instrumentation.txt
0 Defl:N 2 0% 10-28-2025 09:54 00000000 tracer.log
0 Defl:N 2 0% 10-28-2025 09:54 00000000 jmxfetch.txt
-------- ------- --- -------
15237 5322 65% 16 files
Outstanding items
- [x] Add an integration test that exercises JMX functionality when
dd.telemetry.jmx.enabled=true. - [x] Limit the number of long running traces added to the flare report, like is already done for the pending trace buffer (
MAX_DUMPED_TRACES = 50). - [ ] Other updates from the list below?
This PR has a number of commits and I suggest reviewing commit-by-commit, paying special attention to the notes in bold below:
- Trace dump refactor in preparation for adding long running traces -- This doesn't need to be kept in its own commit. I kept it separate for now to make review a little easier.
- Add long_running_traces.json to flare report -- Note: this adds
synchronizedto a few methods (see commit comment for details). - Track long running traces when agent does not support long running feature -- This could be dropped, but if so, I'd highly suggest keeping the warning message (it would need some rewording). Note: if
features.supportsLongRunning()is false, the traces are kept in theTRACKEDstate, compared to theNOT_TRACKEDstate previously. - Add JMX MBean for getting tracer flare files -- Note: see if the JMX MBean ObjectName and operation names sound good. I kept the existing
add*methods as-is, but this could be simplified by refactoring the add* methods into Reporter instances (with a new signature that passes a few more arguments toaddReportToFlare). I think this refactoring would be a good change to make--let me know and I'll happily do that. I also considered not making the zip file an intermediary, and if you like, I could look at what that change might be, as well. - LongRunningTracesTracker: add metric for traces dropped due to sampling priority -- This could be dropped. I'm not sure if this is an important metric to track.
- PendingTraceBuffer: Keep track of how often we write around the buffer -- This does seem like a valuable metric to track.
Note: I had a few fixups that I've merged into the above commits.
Contributor Checklist
- Format the title according the contribution guidelines
- Assign the
type:and (comp:orinst:) labels in addition to any useful labels - Don't use
close,fixor any linking keywords when referencing an issue.
Usesolvesinstead, and assign the PR milestone to the issue - Update the CODEOWNERS file on source file addition, move, or deletion
- Update the public documentation in case of new configuration flag or behavior
Jira ticket: [PROJ-IDENT]
Jira card for context: APMS-17557
Hi DJ 👋 Thanks for your patience! Your notes and commit organization were really great for understanding this PR - I found them especially useful. I left two nit comments, but otherwise it looks good. Since this PR introduces some changes (e.g. keeping long running traces tracked in memory), I've brought it up for more sets of eyes ;). I'm out all of next week but will get back to you after if others don't beat me to it. Thanks again for the contribution!
Thanks, @sarahchen6 and @manuel-alvarez-alvarez! I'll address the few tweaks suggested. Updates coming shortly.
Fixups made, rebased on latest master, and force pushed.