Add traceparent/traceId support for search
Description
This change adds support for traceparent header which can be sent by the user, and thereby supporting traceId as well.
This change is specifically tested/written for a search request, and also adds trace-id support in search slow log as well.
Sample search slow log while the change was manually tested
[2025-10-28T03:57:33,764][TRACE][i.s.s.query ] [runTask-0] [products][0] took[21.7ms], took_millis[21], total_hits[2 hits], stats[], search_type[QUERY_THEN_FETCH], total_shards[1], source[{"query":{"match":{"category":{"query":"electronics","operator":"OR","prefix_length":0,"max_expansions":50,"fuzzy_transpositions":true,"lenient":false,"zero_terms_query":"NONE","auto_generate_synonyms_phrase_query":true,"boost":1.0}}}}], id[], trace-id[4bf92f3577b34da6a3ce929d0e0e4736]
Related Issues
https://github.com/opensearch-project/OpenSearch/issues/18512
Check List
- [x] Functionality includes testing.
- ~[] API changes companion pull request created, if applicable.~
- [ ] Public documentation issue/PR created, if applicable.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Summary by CodeRabbit
-
New Features
- W3C Trace Context (traceparent/trace-id) support for end-to-end trace propagation across search, transport, REST, and task handling.
-
Observability
- Trace-id propagated to responses, request/task headers, and recorded in slow logs for improved request/span correlation and debugging.
-
Tests
- New/updated tests and validators covering traceparent parsing, trace-id extraction, header propagation, and end-to-end trace validation.
βοΈ Tip: You can customize this high-level summary in your review settings.
:x: Gradle check result for 7dc64ff4e3f9744796cdd36183ba3e6bf10b3e60: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for c994206baffb6d36ee89428b2e5cb6301198f13b: null
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 5f7eeb10b88f61fe0b18ab1817885083216968cc: null
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
@reta While OpenTelemetry plugin does provide support to pick http headers and capture traces, but its optional and not part of OpenSearch core. So it does not come by default.
By having traceId accessible at the Task level as part of core, it can be logged in slow logs(which is the intention here). This enables users to correlate slow or failing requests directly to the same trace ID they passed via traceparent. And this can be done without explicitly installing Opentelemetry plugin.
In addition, once traceId is available at the Task level, it can be reused by other components (e.g., metrics, profiling, custom plugins).
There is no need to introduce any custom headers here
These are not custom headers, traceparent header is a standardized HTTP header defined by the W3C Trace Context specification. And OpenTelemetry has the support for same if passed by the user.
@reta While OpenTelemetry plugin does provide support to pick http headers and capture traces, but its optional and not part of OpenSearch core. So it does not come by default.
@sgup432 this is by design, augmenting the OpenSearch with tracing capabilities is done by the plugin. which is supposed to be installed when users have an intention to enable tracing. It goes way beyond simple header (significant efforts were made to instrument all different flows and propagate context all over the place).
If you need quick win - pass the traceparent as X-Opaque-Id and it will become available in tasks out of the box. Otherwise I would suggest to take the approach that is consistent and aligned with the existing design. Any suggestions to improve and evolve it are certainly welcomed.
If you need quick win - pass the traceparent as X-Opaque-Id and it will become available in tasks out of the box. Otherwise I would suggest to take the approach that is consistent and aligned with the existing design. Any suggestions to improve and evolve it are certainly welcomed.
We can't use X-Opaque-id to pass traceparent as X-Opaque-Id is not supposed to be unique per request which is the intention behind adding traceparent support.
I need a way for users to pass traceId as headers per request and be able to log it in slow log or elsewhere without having to install OpenTelemetry plugin. As enabling OpenTelemetry or tracing per request is unnecessary and expensive(performance wise). I don't think supporting this standard header and OpenTelemetry plugin design are related, both are independent and in my opinion we have should have/support both. Tracing just is an extension of this.
We can't use
X-Opaque-idto pass traceparent asX-Opaque-Idis not supposed to be unique per request which is the intention behind adding traceparent support.
Fair point, I would suggest following options:
- do not abuse
traceparentbut introduceX-Request-Idheader fe to serve your specific needs - use OpenTelemetry plugin with tailored configuration (we have many ways here to reduce overhead)
- introduce another telemetry plugin with lightweight implementation
do not abuse traceparent but introduce X-Request-Id header fe to serve your specific needs
I don't think I agree with the abusing traceparent part. traceparent is a standard(and vendor neutral) header which can be used as a plain header and not tied to any framework. So one does not require using it necessarily with tracing framework. If tracing/OpenTelemetry is enabled, tracing framework can simply use it, otherwise it can be propagated in the system for logging purpose.
do not abuse traceparent but introduce X-Request-Id header fe to serve your specific needs
I don't think I agree with the abusing
traceparentpart. traceparent is a standard(and vendor neutral) header which can be used as a plain header and not tied to any framework. So one does not require using it necessarily with tracing framework. If tracing/OpenTelemetry is enabled, tracing framework can simply use it, otherwise it can be propagated in the system for logging purpose.
+1. As a standard header, including traceparent in the core appears reasonable.
Walkthrough
Adds W3C Trace Context handling: validates/extracts traceparent, exposes trace-id, propagates traceparent/trace-id through REST, thread context, transport, tasks, and slow-logs; updates tracer propagation and tests; introduces test feature-flag overrides to disable telemetry.
Changes
| Cohort / File(s) | Change Summary |
|---|---|
Trace header utility & tests server/src/main/java/org/opensearch/common/util/TraceUtil.java, server/src/test/java/org/opensearch/common/util/TraceUtilTests.java |
New TraceUtil validates W3C traceparent and extracts trace-id; comprehensive unit tests for validation and edge cases. |
Task header constants server/src/main/java/org/opensearch/tasks/Task.java |
Added TRACE_PARENT, TRACE_ID, and public REQUEST_HEADERS set including X-Opaque-Id, traceparent, and trace-id. |
ThreadContext / Node header handling server/src/main/java/org/opensearch/common/util/concurrent/ThreadContext.java, server/src/main/java/org/opensearch/node/Node.java |
Generalized stash/propagation to iterate Task.REQUEST_HEADERS (removed special-case X-OPAQUE-ID); node taskHeaders use REQUEST_HEADERS. |
REST, controller, and response propagation server/src/main/java/org/opensearch/action/ActionModule.java, server/src/main/java/org/opensearch/rest/RestController.java, server/src/main/java/org/opensearch/http/DefaultRestChannel.java |
Add TRACE_PARENT to initial REST headers; extract trace-id into thread context during header processing; response copies traceparent when present. |
Transport / tracer integration server/src/main/java/org/opensearch/transport/TransportService.java, libs/telemetry/src/main/java/org/opensearch/telemetry/tracing/DefaultTracer.java |
Pass thread-context headers into tracer/span creation to propagate traceparent; DefaultTracer short-circuits header extraction when an explicit parent exists. |
Slow log enrichment & tests server/src/main/java/org/opensearch/action/search/SearchRequestSlowLog.java, server/src/main/java/org/opensearch/index/SearchSlowLog.java, server/src/test/java/org/opensearch/index/SearchSlowLogTests.java |
Add trace-id (from Task.TRACE_ID) to structured slow-log map and plaintext log line; tests updated to include TRACE_ID. |
Telemetry test helpers & validators test/telemetry/src/main/java/org/opensearch/test/telemetry/tracing/MockTracingContextPropagator.java, test/telemetry/src/main/java/org/opensearch/test/telemetry/tracing/validators/AllSpansHaveCorrectTraceId.java, plugins/telemetry-otel/.../TelemetryTracerEnabledSanityIT.java |
Mock propagator emits W3C-style traceparent and trace-id; new validator ensures spans carry expected trace-id; sanity IT sends headers and validates spans. |
Task tests and task header assertions server/src/internalClusterTest/.../TasksIT.java |
Tests updated to inject and assert TRACE_PARENT and TRACE_ID headers; expected header counts and conditional checks adjusted. |
Test feature-flag overrides modules/transport-netty4/...OpenSearchNetty4IntegTestCase.java, plugins/transport-reactor-netty4/...OpenSearchReactorNetty4IntegTestCase.java, qa/smoke-test-http/src/test/java/org/opensearch/http/HttpSmokeTestCase.java |
Added featureFlagSettings() overrides to disable telemetry (FeatureFlags.TELEMETRY_SETTING=false) for test base classes. |
Action module & changelog server/src/main/java/org/opensearch/action/ActionModule.java, CHANGELOG.md |
TRACE_PARENT included in initial REST headers; changelog entry added for traceparent/traceId support under Unreleased 3.x. |
Sequence Diagram(s)
sequenceDiagram
participant Client
participant REST as RestController
participant ThreadCtx as ThreadContext
participant Tracer
participant Transport
participant SlowLog
Client->>REST: HTTP request with `traceparent` header
REST->>REST: de-duplicate header values
REST->>REST: call TraceUtil.extractTraceId(traceparent)
REST->>ThreadCtx: put `trace-id` into thread context
REST->>Tracer: startSpan(..., headers=ThreadCtx.getHeaders())
Tracer->>Transport: propagate headers on outgoing request
Transport->>Tracer: receiver starts span with propagated headers
Tracer->>SlowLog: attach `trace-id` to slow-log entries
REST->>Client: HTTP response (copies `traceparent` header)
Estimated code review effort
π― 3 (Moderate) | β±οΈ ~30 minutes
- Files needing extra attention:
-
TraceUtil.javaβ strict W3C validation logic and exact exception messages. -
RestController.java/DefaultRestChannel.javaβ header de-duplication, trace-id extraction, and response copying. -
DefaultTracer.java/TransportService.javaβ header propagation and short-circuit behavior. - Tests & mocks (
MockTracingContextPropagator,AllSpansHaveCorrectTraceId, updated ITs) β ensure expectations match new header formats and feature-flag overrides.
-
Suggested labels
enhancement, Search, Other
Suggested reviewers
- reta
- peternied
- andrross
- cwperks
- dbwiddis
- sachinpkale
- msfroh
- gbbafna
Poem
π I sniffed a trace upon the breeze,
I tucked its id in thread and keys,
From client hop to slow-log den,
Spans now follow where traces went,
A rabbitβs hop that links all trees π₯
Pre-merge checks and finishing touches
β Failed checks (1 warning)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Docstring Coverage | β οΈ Warning | Docstring coverage is 27.66% which is insufficient. The required threshold is 80.00%. | You can run @coderabbitai generate docstrings to improve docstring coverage. |
β Passed checks (2 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title check | β Passed | The title 'Add traceparent/traceId support for search' directly and clearly describes the main feature addition in the changeset: support for traceparent header and traceId extraction for search operations. |
| Description check | β Passed | The description provides clear context about the change (traceparent header extraction and traceId support for search), includes a concrete sample slow-log output demonstrating the feature, links the related GitHub issue, and confirms testing was included. All critical sections from the template are addressed. |
β¨ Finishing touches
π§ͺ Generate unit tests (beta)
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
π Recent review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
π₯ Commits
Reviewing files that changed from the base of the PR and between e683e51b79be71ddc65b872085788dac67f7964e and ce6ee9c7480ffbb50df4417cab9aa2a561bd530b.
π Files selected for processing (1)
-
CHANGELOG.md(1 hunks)
π§ Files skipped from review as they are similar to previous changes (1)
- CHANGELOG.md
β° Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (21)
- GitHub Check: gradle-check
- GitHub Check: assemble (25, ubuntu-24.04-arm)
- GitHub Check: assemble (21, ubuntu-latest)
- GitHub Check: assemble (21, windows-latest)
- GitHub Check: assemble (21, ubuntu-24.04-arm)
- GitHub Check: assemble (25, windows-latest)
- GitHub Check: assemble (25, ubuntu-latest)
- GitHub Check: Analyze (java)
- GitHub Check: detect-breaking-change
- GitHub Check: precommit (25, windows-latest)
- GitHub Check: precommit (21, windows-latest)
- GitHub Check: precommit (21, windows-2025, true)
- GitHub Check: precommit (25, macos-15)
- GitHub Check: precommit (25, ubuntu-24.04-arm)
- GitHub Check: precommit (21, ubuntu-latest)
- GitHub Check: precommit (21, ubuntu-24.04-arm)
- GitHub Check: precommit (25, macos-15-intel)
- GitHub Check: precommit (25, ubuntu-latest)
- GitHub Check: precommit (21, macos-15-intel)
- GitHub Check: precommit (21, macos-15)
- GitHub Check: Mend Security Check
Comment @coderabbitai help to get the list of available commands and usage tips.
@kkewwei Can you help take another look? And help in merging these changes if it looks good?
:x: Gradle check result for 17c806080efa1a193fb9a5a486dc7daf8ed7c82e: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for c8582ccb1821012b75a1a14bc836e227525f932c: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 30d058cb79605457617383cdfb5895c4d86e2502: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for e6e86afef407b42198cef716a83b1a65369a4e75: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for ceeba753edc8b5c54025a60e9d0195c22088c818: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 9f95cc9ec2540bbbf3b4dfd4788827c557eca34b: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 9f95cc9ec2540bbbf3b4dfd4788827c557eca34b: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for e98610a9374a06361cd9020b0468285aff3d4d75: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 24f9e616990500f4279eabe5d225275ed06acfe0: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 1d294609d089114c93030cc0e508573c1d5d9024: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for e7373a4f404a7a13d8dbdd093d5389c1d5b7122d: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 1f7b5d067e6914325170017b694a0c90f1f50e11: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
Hi all! We are currently doing some work for a Fortune 500 Company and using the AWS OpenSearch Managed Service. We have been looking for some functionality to trace requests through our downstream services and into OpenSearch and the functionality in this PR looks perfect to support our use case!
Just wanted to leave a comment because I've seen some conversation here about this PR replicating functionality which already exists via the OpenTelemetry Plugin. However, this Plugin is not supported in the AWS OpenSearch Managed Service. As such, including this functionality natively in OpenSearch would (eventually) allow users of the Managed Service access to tracing functionality, which they do not have now. This would be hugely useful for our use case, and I'm sure many others, so definitely in support of getting this PR merged in, since it does contain net new functionality, at least for users of the Managed Service.
@spencersolomon6 Good to know that! @kkewwei I am going to retry running the gradle build as there are some flaky test, might need some help to get this merged after the tests passes.
:x: Gradle check result for e1b25017f61faaa84d7b3514206797f27580a4d3: ABORTED
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:x: Gradle check result for 990fe37db42123bd08bae005fb05b48e9bc7d083: FAILURE
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?
:grey_exclamation: Gradle check result for e683e51b79be71ddc65b872085788dac67f7964e: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.
Codecov Report
:x: Patch coverage is 78.94737% with 12 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 73.26%. Comparing base (84bc68e) to head (ce6ee9c).
Additional details and impacted files
@@ Coverage Diff @@
## main #19798 +/- ##
============================================
+ Coverage 73.21% 73.26% +0.05%
- Complexity 71776 71801 +25
============================================
Files 5795 5796 +1
Lines 328304 328354 +50
Branches 47281 47296 +15
============================================
+ Hits 240374 240575 +201
+ Misses 68684 68492 -192
- Partials 19246 19287 +41
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@kkewwei Can you help with the merge if it looks okay to you?