dify icon indicating copy to clipboard operation
dify copied to clipboard

Add OpenTelemetry instrumentation for business layer

Open hieheihei opened this issue 1 month ago • 2 comments

Self Checks

  • [x] I have read the Contributing Guide and Language Policy.
  • [x] I have searched for existing issues search for existing issues, including closed ones.
  • [x] I confirm that I am using English to submit this report, otherwise it will be closed.
  • [x] Please do not modify this template :) and fill in all the required fields.

1. Is this request related to a challenge you're experiencing? Tell me about your story.

We have integrated the OpenTelemetry SDK with automatic instrumentation for infrastructure components (HTTP, DB, Redis, Celery). But the collected telemetry data is very limited—only infrastructure-level spans are captured.

The existing OpsTrace mechanism captures Workflow execution data; however, its asynchronous post-execution collection model differs from OpenTelemetry's real-time instrumentation, making direct trace correlation impossible.

We suggest adding OpenTelemetry instrumentation to the business layer. The proposed instrumentation targets include: AppGenerateService.generate() (entry point), various AppRunner.run() methods (e.g., WorkflowAppRunner, ChatAppRunner), Workflow and Node execution, LLM model invocations, and RAG retrieval calls.

2. Additional context or comments

No response

3. Can you help us with this feature?

  • [x] I am interested in contributing to this feature.

hieheihei avatar Nov 20 '25 07:11 hieheihei

Overview

We plan to add OpenTelemetry (OTel) instrumentation at two levels:

  1. Regular methods (e.g., AppGenerateService.generate)
  2. Workflow execution methods in the graph engine

Approach 1: Decorator-based Instrumentation for Regular Methods

For regular methods, we use a decorator-based approach:

@trace_span(AppGenerateHandler)
def generate(...):
    ...

This decorator wraps the method with OTel span creation and context propagation, allowing automatic instrumentation of downstream calls (HTTP requests, database queries, etc.) to be correctly nested as child spans.

Approach 2: Graph Engine Instrumentation Options

For workflow execution, we have two options:

Option 2a: Decorator-based Approach

  • Add the decorator at the base node execution entry point (e.g., Node.run)
  • The decorator delegates to Node-specific handlers that branch by node_type

Option 2b: Layer-based Approach

  • Leverage the graph engine's existing Layer system

Challenge with Current Layer Mechanism

The Layer system's on_event hook is called after node execution completes. This creates a problem:

  • If we only rely on on_event, we cannot inject OTel context during node execution
  • When a node execution creates Span A, and that execution makes an HTTP request (Span B), Span B is created without Span A's OTel context
  • As a result, the trace cannot correctly establish Span A as the parent of Span B

Proposed Solution: Extend Graph Engine with Node Execution Hooks

To solve this, we propose extending the graph engine to add two new Layer hooks:

  1. on_node_run_start: Called immediately before node execution begins
  2. on_node_run_end: Called immediately after node execution completes

Implementation Details

Flow

  1. In on_node_run_start: Create and inject OTel context (span) before node execution
  2. During node execution: Automatic instrumentation (HTTP, DB, etc.) will correctly inherit the parent span context
  3. In on_event: Parse execution results and set span attributes based on event data
  4. In on_node_run_end: Clean up OTel context

Implementation Considerations

Several subtle issues need attention:

  • Span Caching: Cache active spans in memory using node._node_execution_id as the key
  • Concurrency: Handle concurrent access to cached spans across multiple worker threads
  • Event Timing: on_event may arrive asynchronously; ensure span attributes can be set even if events arrive after on_node_run_end
  • Context Management: Manually manage OTel context tokens for proper propagation and cleanup

hieheihei avatar Nov 20 '25 13:11 hieheihei

Your analysis is spot on: the current Layer system in Dify’s graph engine only provides post-execution hooks like on_event, so it can’t inject OpenTelemetry context during node execution. This means downstream spans (e.g., HTTP, DB) won’t be properly parented unless you add pre- and post-execution hooks as you proposed. There are no built-in pre-execution hooks (on_node_run_start/on_node_run_end)—you’d need to extend the engine to support these for real-time context propagation and span management (source).

For span caching, the node execution ID (used in the persistence layer) is unique and stable for the duration of node execution, so it’s suitable as a cache key even with concurrency (source). The event manager uses a thread-safe lock, but you’ll need to ensure your span cache is also thread-safe.

Event timing is a real concern: on_event is always post-execution, and events may arrive after your proposed on_node_run_end. You’ll need to design your span lifecycle and attribute-setting logic to handle this, possibly by keeping spans open until all relevant events are processed or by updating span attributes after the fact if your tracing backend supports it.

The decorator-based approach for regular business methods is straightforward and aligns with Python best practices.

In summary: your plan to add pre/post node execution hooks is necessary for correct OTel context propagation in Dify’s workflow engine. Pay close attention to concurrency and event timing when managing span lifecycles. Let me know if you want a code-level breakdown or have questions about specific extension points!

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

dosubot[bot] avatar Nov 20 '25 13:11 dosubot[bot]