julep Add tiered telemetry

User description

Summary

add OpenTelemetry-based telemetry module for agents-api
init telemetry in web app startup
capture workflow metrics in Temporal client
increment query metrics via telemetry counters
include OTel dependencies

Testing

ruff format .
ruff check . (auto-fixed)
pyright agents_api (fails: many missing import errors)
ward (fails: command not found)

PR Type

Enhancement

Description

Introduce OpenTelemetry-based telemetry module for agents-api
- Add telemetry.py for tiered telemetry setup and management
- Support configuration via environment and YAML config
- Integrate crash, metrics, and latency monitoring
Integrate telemetry into workflow and query metrics
- Increment telemetry counters in Temporal workflow and query handlers
Initialize telemetry during FastAPI app startup
Add OpenTelemetry dependencies to project requirements

Changes walkthrough 📝

Relevant files

Enhancement

telemetry.py `Add OpenTelemetry-based telemetry module with tiered levels` agents-api/agents_api/telemetry.py Add new module for OpenTelemetry-based telemetry management Support tiered telemetry levels and persistent instance ID Integrate crash, metrics, and latency monitoring middleware Provide initialization function for FastAPI app	+155/-0
web.py `Initialize telemetry in FastAPI app startup` agents-api/agents_api/web.py Import and initialize telemetry during FastAPI app startup	+3/-0
temporal.py `Integrate telemetry workflow counter in Temporal client` agents-api/agents_api/clients/temporal.py Import telemetry module Increment workflow counter after starting workflow if telemetry enabled	+7/-1
counters.py `Integrate telemetry queries counter in metrics decorators` agents-api/agents_api/metrics/counters.py Import telemetry module Increment queries counter in metric decorators if telemetry enabled	+6/-0

Dependencies

pyproject.toml

Add OpenTelemetry dependencies to project requirements

agents-api/pyproject.toml

Add OpenTelemetry SDK, exporter, and FastAPI instrumentation to
dependencies

+3/-0

Need help?
Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
Check out the documentation for more information.

[!IMPORTANT] Add OpenTelemetry-based telemetry module to agents-api, capturing workflow and query metrics, and update dependencies.

Telemetry Module:

Adds telemetry.py for OpenTelemetry-based telemetry in agents-api.

Initializes telemetry in web.py during app startup with init_telemetry().

Metrics:

Captures workflow metrics in run_task_execution_workflow() in temporal.py.

Increments query metrics in counters.py using telemetry counters.

Dependencies:

Adds opentelemetry-sdk, opentelemetry-exporter-otlp, and opentelemetry-instrumentation-fastapi to pyproject.toml.

^{This description was created by}^{for a63b63683cec2ee45bc1b38f461dda39704fb4b6. You can customize this summary. It will automatically update as commits are pushed.}

May 19 '25 11:05 creatorrr

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪

🧪 No relevant tests

🔒 No security concerns identified

⚡ Recommended focus areas for review

Error Handling

The _load_config function doesn't handle YAML parsing errors, which could cause the application to crash if the config file is malformed.

def _load_config() -> dict:
    if CONFIG_PATH.exists():
        with open(CONFIG_PATH) as f:
            return yaml.safe_load(f) or {}
    return {}

Resource Leak

The latency_middleware function creates a trace span but doesn't properly clean it up, potentially causing resource leaks in high-traffic scenarios.

@self.app.middleware("http")
async def latency_middleware(request: Request, call_next):
    start = trace.get_current_span().context.trace_id
    del start  # AIDEV-NOTE: trace id not needed; using for type check
    start_time = time.monotonic()
    response = await call_next(request)
    duration = time.monotonic() - start_time
    latency_hist.record(duration, {"route": request.url.path})
    return response

Configuration Validation

The telemetry level is cast to int without validation, which could cause crashes if non-integer values are provided in the environment variable or config file.

def _determine_level(self) -> tuple[int, str]:
    level_env = os.getenv("JULEP_TELEMETRY")
    config = _load_config()
    if level_env is not None:
        level = int(level_env)
    else:
        level = int(config.get("telemetry", DEFAULT_LEVEL))

May 19 '25 11:05 qodo-merge-for-open-source[bot]

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Add error handling The code attempts to convert potentially non-integer values to integers without error handling. If the environment variable or config value contains non-integer data, this will crash. Add error handling to gracefully handle invalid values. agents-api/agents_api/telemetry.py [56-62] `def _determine_level(self) -> tuple[int, str]: level_env = os.getenv("JULEP_TELEMETRY") config = _load_config() - if level_env is not None: - level = int(level_env) - else: - level = int(config.get("telemetry", DEFAULT_LEVEL)) + try: + if level_env is not None: + level = int(level_env) + else: + level = int(config.get("telemetry", DEFAULT_LEVEL)) + except (ValueError, TypeError): + logger.warning("Invalid telemetry level, using default") + level = DEFAULT_LEVEL` [ ] Apply / Chat Suggestion importance[1-10]: 8 __ Why: Adding error handling for parsing the telemetry level prevents potential crashes from invalid configuration, which is important for robustness and reliability.	Medium
General	Improve metrics documentation The counters are created without any attributes or description, which limits their usefulness for analysis. Add descriptions and default attributes to provide context for the metrics. agents-api/agents_api/telemetry.py [132-136] def _setup_metrics(self) -> None: meter = metrics.get_meter("julep.agents_api") - self.queries_counter = meter.create_counter("julep.queries.count") - self.workflows_counter = meter.create_counter("julep.workflows.count") - latency_hist = meter.create_histogram("julep.request_latency.seconds") + self.queries_counter = meter.create_counter( + "julep.queries.count", + description="Number of queries processed" + ) + self.workflows_counter = meter.create_counter( + "julep.workflows.count", + description="Number of workflows started" + ) + latency_hist = meter.create_histogram( + "julep.request_latency.seconds", + description="HTTP request latency in seconds" + ) [ ] Apply / Chat Suggestion importance[1-10]: 6 __ Why: Adding descriptions to metrics improves their clarity and usefulness for observability, but this is a moderate improvement rather than a critical fix.	Low
General	Remove confusing variable usage The code retrieves a trace ID but immediately deletes it with a comment indicating it's only for type checking. This is confusing and inefficient. Instead, directly check if tracing is active without creating unused variables. agents-api/agents_api/telemetry.py [140-141] `-start = trace.get_current_span().context.trace_id -del start # AIDEV-NOTE: trace id not needed; using for type check +# Check if tracing is active without creating unused variables +_ = trace.get_current_span().context.trace_id # Verify tracing is initialized` [ ] Apply / Chat Suggestion importance[1-10]: 4 __ Why: The suggestion removes an unnecessary variable assignment and deletion, making the code slightly clearer, but the impact is minor as this is a non-functional change.	Low
[ ] More

May 19 '25 11:05 qodo-merge-for-open-source[bot]

CI Feedback 🧐

(Feedback updated until commit https://github.com/julep-ai/julep/commit/14bcb2276bce2066f704be94b6a95c1dee5310da)

A test triggered by this PR failed. Here is an AI-generated analysis of the failure:

Action: Test

Failed stage: Run tests [❌]

Failure summary:

The action failed due to an import error in the Python code. Specifically, in the file
/home/runner/work/julep/julep/agents-api/agents_api/telemetry.py at line 17, the code is trying to
import OTLPMetricExporter from opentelemetry.sdk.metrics.export, but this name doesn't exist in that
module. The error suggests that perhaps MetricExporter was the intended import instead.

Relevant error logs:

1:  ##[group]Operating System
2:  Ubuntu
...

965:  prune-cache: true
966:  ignore-nothing-to-cache: false
967:  ##[endgroup]
968:  Downloading uv from "https://github.com/astral-sh/uv/releases/download/0.7.5/uv-x86_64-unknown-linux-gnu.tar.gz" ...
969:  [command]/usr/bin/tar xz --warning=no-unknown-keyword --overwrite -C /home/runner/work/_temp/9cb436d6-652d-4695-a027-0255c75f7e09 -f /home/runner/work/_temp/5bc7a123-9d6a-4f66-9949-bde67355889b
970:  Added /opt/hostedtoolcache/uv/0.7.5/x86_64 to the path
971:  Added /home/runner/.local/bin to the path
972:  Set UV_CACHE_DIR to /home/runner/work/_temp/setup-uv-cache
973:  Successfully installed uv version 0.7.5
974:  Searching files using cache dependency glob: **/uv.lock
975:  /home/runner/work/julep/julep/agents-api/uv.lock
976:  /home/runner/work/julep/julep/cli/uv.lock
977:  /home/runner/work/julep/julep/integrations-service/uv.lock
978:  Found 3 files to hash.
979:  Trying to restore uv cache from GitHub Actions cache with key: setup-uv-1-x86_64-unknown-linux-gnu-0.7.5-d92603d25acef1c08e643c37cc2475e5e190deb9690356b084828d60043a591f
980:  ##[warning]Failed to restore: Cache service responded with 422
981:  No GitHub Actions cache found for key: setup-uv-1-x86_64-unknown-linux-gnu-0.7.5-d92603d25acef1c08e643c37cc2475e5e190deb9690356b084828d60043a591f
...

1460:  from .protocol.tasks import ExecutionInput, StepContext, StepOutcome
1461:  File "/home/runner/work/julep/julep/agents-api/agents_api/common/protocol/tasks.py", line 27, in <module>
1462:  from ...activities.pg_query_step import pg_query_step
1463:  File "/home/runner/work/julep/julep/agents-api/agents_api/activities/pg_query_step.py", line 6, in <module>
1464:  from .. import queries
1465:  File "/home/runner/work/julep/julep/agents-api/agents_api/queries/__init__.py", line 9, in <module>
1466:  from . import agents as agents
1467:  File "/home/runner/work/julep/julep/agents-api/agents_api/queries/agents/__init__.py", line 13, in <module>
1468:  from .create_agent import create_agent
1469:  File "/home/runner/work/julep/julep/agents-api/agents_api/queries/agents/create_agent.py", line 15, in <module>
1470:  from ...metrics.counters import query_metrics
1471:  File "/home/runner/work/julep/julep/agents-api/agents_api/metrics/counters.py", line 8, in <module>
1472:  from ..telemetry import telemetry
1473:  File "/home/runner/work/julep/julep/agents-api/agents_api/telemetry.py", line 17, in <module>
1474:  from opentelemetry.sdk.metrics.export import (
1475:  ImportError: cannot import name 'OTLPMetricExporter' from 'opentelemetry.sdk.metrics.export' (/home/runner/work/julep/julep/agents-api/.venv/lib/python3.12/site-packages/opentelemetry/sdk/metrics/export/__init__.py). Did you mean: 'MetricExporter'?
1476:  ##[error]Process completed with exit code 1.
1477:  Post job cleanup.

May 19 '25 11:05 qodo-merge-for-open-source[bot]

julep julep copied to clipboard

Add tiered telemetry

User description

Summary

Testing

PR Type

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

PR Code Suggestions ✨

CI Feedback 🧐

(Feedback updated until commit https://github.com/julep-ai/julep/commit/14bcb2276bce2066f704be94b6a95c1dee5310da)

julep
julep copied to clipboard