Add OpenTelemetry instrumentation for business layer
Self Checks
- [x] I have read the Contributing Guide and Language Policy.
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report, otherwise it will be closed.
- [x] Please do not modify this template :) and fill in all the required fields.
1. Is this request related to a challenge you're experiencing? Tell me about your story.
We have integrated the OpenTelemetry SDK with automatic instrumentation for infrastructure components (HTTP, DB, Redis, Celery). But the collected telemetry data is very limited—only infrastructure-level spans are captured.
The existing OpsTrace mechanism captures Workflow execution data; however, its asynchronous post-execution collection model differs from OpenTelemetry's real-time instrumentation, making direct trace correlation impossible.
We suggest adding OpenTelemetry instrumentation to the business layer. The proposed instrumentation targets include: AppGenerateService.generate() (entry point), various AppRunner.run() methods (e.g., WorkflowAppRunner, ChatAppRunner), Workflow and Node execution, LLM model invocations, and RAG retrieval calls.
2. Additional context or comments
No response
3. Can you help us with this feature?
- [x] I am interested in contributing to this feature.
Overview
We plan to add OpenTelemetry (OTel) instrumentation at two levels:
- Regular methods (e.g.,
AppGenerateService.generate) - Workflow execution methods in the graph engine
Approach 1: Decorator-based Instrumentation for Regular Methods
For regular methods, we use a decorator-based approach:
@trace_span(AppGenerateHandler)
def generate(...):
...
This decorator wraps the method with OTel span creation and context propagation, allowing automatic instrumentation of downstream calls (HTTP requests, database queries, etc.) to be correctly nested as child spans.
Approach 2: Graph Engine Instrumentation Options
For workflow execution, we have two options:
Option 2a: Decorator-based Approach
- Add the decorator at the base node execution entry point (e.g.,
Node.run) - The decorator delegates to Node-specific handlers that branch by
node_type
Option 2b: Layer-based Approach
- Leverage the graph engine's existing Layer system
Challenge with Current Layer Mechanism
The Layer system's on_event hook is called after node execution completes. This creates a problem:
- If we only rely on
on_event, we cannot inject OTel context during node execution - When a node execution creates Span A, and that execution makes an HTTP request (Span B), Span B is created without Span A's OTel context
- As a result, the trace cannot correctly establish Span A as the parent of Span B
Proposed Solution: Extend Graph Engine with Node Execution Hooks
To solve this, we propose extending the graph engine to add two new Layer hooks:
on_node_run_start: Called immediately before node execution beginson_node_run_end: Called immediately after node execution completes
Implementation Details
Flow
- In
on_node_run_start: Create and inject OTel context (span) before node execution - During node execution: Automatic instrumentation (HTTP, DB, etc.) will correctly inherit the parent span context
- In
on_event: Parse execution results and set span attributes based on event data - In
on_node_run_end: Clean up OTel context
Implementation Considerations
Several subtle issues need attention:
- Span Caching: Cache active spans in memory using
node._node_execution_idas the key - Concurrency: Handle concurrent access to cached spans across multiple worker threads
- Event Timing:
on_eventmay arrive asynchronously; ensure span attributes can be set even if events arrive afteron_node_run_end - Context Management: Manually manage OTel context tokens for proper propagation and cleanup
Your analysis is spot on: the current Layer system in Dify’s graph engine only provides post-execution hooks like on_event, so it can’t inject OpenTelemetry context during node execution. This means downstream spans (e.g., HTTP, DB) won’t be properly parented unless you add pre- and post-execution hooks as you proposed. There are no built-in pre-execution hooks (on_node_run_start/on_node_run_end)—you’d need to extend the engine to support these for real-time context propagation and span management (source).
For span caching, the node execution ID (used in the persistence layer) is unique and stable for the duration of node execution, so it’s suitable as a cache key even with concurrency (source). The event manager uses a thread-safe lock, but you’ll need to ensure your span cache is also thread-safe.
Event timing is a real concern: on_event is always post-execution, and events may arrive after your proposed on_node_run_end. You’ll need to design your span lifecycle and attribute-setting logic to handle this, possibly by keeping spans open until all relevant events are processed or by updating span attributes after the fact if your tracing backend supports it.
The decorator-based approach for regular business methods is straightforward and aligns with Python best practices.
In summary: your plan to add pre/post node execution hooks is necessary for correct OTel context propagation in Dify’s workflow engine. Pay close attention to concurrency and event timing when managing span lifecycles. Let me know if you want a code-level breakdown or have questions about specific extension points!
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other