julep icon indicating copy to clipboard operation
julep copied to clipboard

Add tiered telemetry

Open creatorrr opened this issue 6 months ago • 3 comments

User description

Summary

  • add OpenTelemetry-based telemetry module for agents-api
  • init telemetry in web app startup
  • capture workflow metrics in Temporal client
  • increment query metrics via telemetry counters
  • include OTel dependencies

Testing

  • ruff format .
  • ruff check . (auto-fixed)
  • pyright agents_api (fails: many missing import errors)
  • ward (fails: command not found)

PR Type

Enhancement


Description

  • Introduce OpenTelemetry-based telemetry module for agents-api

    • Add telemetry.py for tiered telemetry setup and management
    • Support configuration via environment and YAML config
    • Integrate crash, metrics, and latency monitoring
  • Integrate telemetry into workflow and query metrics

    • Increment telemetry counters in Temporal workflow and query handlers
  • Initialize telemetry during FastAPI app startup

  • Add OpenTelemetry dependencies to project requirements


Changes walkthrough 📝

Relevant files
Enhancement
telemetry.py
Add OpenTelemetry-based telemetry module with tiered levels

agents-api/agents_api/telemetry.py

  • Add new module for OpenTelemetry-based telemetry management
  • Support tiered telemetry levels and persistent instance ID
  • Integrate crash, metrics, and latency monitoring middleware
  • Provide initialization function for FastAPI app
  • +155/-0 
    web.py
    Initialize telemetry in FastAPI app startup                           

    agents-api/agents_api/web.py

    • Import and initialize telemetry during FastAPI app startup
    +3/-0     
    temporal.py
    Integrate telemetry workflow counter in Temporal client   

    agents-api/agents_api/clients/temporal.py

  • Import telemetry module
  • Increment workflow counter after starting workflow if telemetry
    enabled
  • +7/-1     
    counters.py
    Integrate telemetry queries counter in metrics decorators

    agents-api/agents_api/metrics/counters.py

  • Import telemetry module
  • Increment queries counter in metric decorators if telemetry enabled
  • +6/-0     
    Dependencies
    pyproject.toml
    Add OpenTelemetry dependencies to project requirements     

    agents-api/pyproject.toml

  • Add OpenTelemetry SDK, exporter, and FastAPI instrumentation to
    dependencies
  • +3/-0     

    Need help?
  • Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
  • Check out the documentation for more information.

  • [!IMPORTANT] Add OpenTelemetry-based telemetry module to agents-api, capturing workflow and query metrics, and update dependencies.

    • Telemetry Module:
      • Adds telemetry.py for OpenTelemetry-based telemetry in agents-api.
      • Initializes telemetry in web.py during app startup with init_telemetry().
    • Metrics:
      • Captures workflow metrics in run_task_execution_workflow() in temporal.py.
      • Increments query metrics in counters.py using telemetry counters.
    • Dependencies:
      • Adds opentelemetry-sdk, opentelemetry-exporter-otlp, and opentelemetry-instrumentation-fastapi to pyproject.toml.

    This description was created by Ellipsis for a63b63683cec2ee45bc1b38f461dda39704fb4b6. You can customize this summary. It will automatically update as commits are pushed.

    creatorrr avatar May 19 '25 11:05 creatorrr

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Error Handling

    The _load_config function doesn't handle YAML parsing errors, which could cause the application to crash if the config file is malformed.

    def _load_config() -> dict:
        if CONFIG_PATH.exists():
            with open(CONFIG_PATH) as f:
                return yaml.safe_load(f) or {}
        return {}
    
    Resource Leak

    The latency_middleware function creates a trace span but doesn't properly clean it up, potentially causing resource leaks in high-traffic scenarios.

    @self.app.middleware("http")
    async def latency_middleware(request: Request, call_next):
        start = trace.get_current_span().context.trace_id
        del start  # AIDEV-NOTE: trace id not needed; using for type check
        start_time = time.monotonic()
        response = await call_next(request)
        duration = time.monotonic() - start_time
        latency_hist.record(duration, {"route": request.url.path})
        return response
    
    Configuration Validation

    The telemetry level is cast to int without validation, which could cause crashes if non-integer values are provided in the environment variable or config file.

    def _determine_level(self) -> tuple[int, str]:
        level_env = os.getenv("JULEP_TELEMETRY")
        config = _load_config()
        if level_env is not None:
            level = int(level_env)
        else:
            level = int(config.get("telemetry", DEFAULT_LEVEL))
    

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Add error handling

    The code attempts to convert potentially non-integer values to integers without
    error handling. If the environment variable or config value contains non-integer
    data, this will crash. Add error handling to gracefully handle invalid values.

    agents-api/agents_api/telemetry.py [56-62]

     def _determine_level(self) -> tuple[int, str]:
         level_env = os.getenv("JULEP_TELEMETRY")
         config = _load_config()
    -    if level_env is not None:
    -        level = int(level_env)
    -    else:
    -        level = int(config.get("telemetry", DEFAULT_LEVEL))
    +    try:
    +        if level_env is not None:
    +            level = int(level_env)
    +        else:
    +            level = int(config.get("telemetry", DEFAULT_LEVEL))
    +    except (ValueError, TypeError):
    +        logger.warning("Invalid telemetry level, using default")
    +        level = DEFAULT_LEVEL
    
    • [ ] Apply / Chat
    Suggestion importance[1-10]: 8

    __

    Why: Adding error handling for parsing the telemetry level prevents potential crashes from invalid configuration, which is important for robustness and reliability.

    Medium
    General
    Improve metrics documentation

    The counters are created without any attributes or description, which limits
    their usefulness for analysis. Add descriptions and default attributes to
    provide context for the metrics.

    agents-api/agents_api/telemetry.py [132-136]

     def _setup_metrics(self) -> None:
         meter = metrics.get_meter("julep.agents_api")
    -    self.queries_counter = meter.create_counter("julep.queries.count")
    -    self.workflows_counter = meter.create_counter("julep.workflows.count")
    -    latency_hist = meter.create_histogram("julep.request_latency.seconds")
    +    self.queries_counter = meter.create_counter(
    +        "julep.queries.count", 
    +        description="Number of queries processed"
    +    )
    +    self.workflows_counter = meter.create_counter(
    +        "julep.workflows.count", 
    +        description="Number of workflows started"
    +    )
    +    latency_hist = meter.create_histogram(
    +        "julep.request_latency.seconds",
    +        description="HTTP request latency in seconds"
    +    )
    
    • [ ] Apply / Chat
    Suggestion importance[1-10]: 6

    __

    Why: Adding descriptions to metrics improves their clarity and usefulness for observability, but this is a moderate improvement rather than a critical fix.

    Low
    Remove confusing variable usage

    The code retrieves a trace ID but immediately deletes it with a comment
    indicating it's only for type checking. This is confusing and inefficient.
    Instead, directly check if tracing is active without creating unused variables.

    agents-api/agents_api/telemetry.py [140-141]

    -start = trace.get_current_span().context.trace_id
    -del start  # AIDEV-NOTE: trace id not needed; using for type check
    +# Check if tracing is active without creating unused variables
    +_ = trace.get_current_span().context.trace_id  # Verify tracing is initialized
    
    • [ ] Apply / Chat
    Suggestion importance[1-10]: 4

    __

    Why: The suggestion removes an unnecessary variable assignment and deletion, making the code slightly clearer, but the impact is minor as this is a non-functional change.

    Low
    • [ ] More

    CI Feedback 🧐

    (Feedback updated until commit https://github.com/julep-ai/julep/commit/14bcb2276bce2066f704be94b6a95c1dee5310da)

    A test triggered by this PR failed. Here is an AI-generated analysis of the failure:

    Action: Test

    Failed stage: Run tests [❌]

    Failure summary:

    The action failed due to an import error in the Python code. Specifically, in the file
    /home/runner/work/julep/julep/agents-api/agents_api/telemetry.py at line 17, the code is trying to
    import OTLPMetricExporter from opentelemetry.sdk.metrics.export, but this name doesn't exist in that
    module. The error suggests that perhaps MetricExporter was the intended import instead.

    Relevant error logs:
    1:  ##[group]Operating System
    2:  Ubuntu
    ...
    
    965:  prune-cache: true
    966:  ignore-nothing-to-cache: false
    967:  ##[endgroup]
    968:  Downloading uv from "https://github.com/astral-sh/uv/releases/download/0.7.5/uv-x86_64-unknown-linux-gnu.tar.gz" ...
    969:  [command]/usr/bin/tar xz --warning=no-unknown-keyword --overwrite -C /home/runner/work/_temp/9cb436d6-652d-4695-a027-0255c75f7e09 -f /home/runner/work/_temp/5bc7a123-9d6a-4f66-9949-bde67355889b
    970:  Added /opt/hostedtoolcache/uv/0.7.5/x86_64 to the path
    971:  Added /home/runner/.local/bin to the path
    972:  Set UV_CACHE_DIR to /home/runner/work/_temp/setup-uv-cache
    973:  Successfully installed uv version 0.7.5
    974:  Searching files using cache dependency glob: **/uv.lock
    975:  /home/runner/work/julep/julep/agents-api/uv.lock
    976:  /home/runner/work/julep/julep/cli/uv.lock
    977:  /home/runner/work/julep/julep/integrations-service/uv.lock
    978:  Found 3 files to hash.
    979:  Trying to restore uv cache from GitHub Actions cache with key: setup-uv-1-x86_64-unknown-linux-gnu-0.7.5-d92603d25acef1c08e643c37cc2475e5e190deb9690356b084828d60043a591f
    980:  ##[warning]Failed to restore: Cache service responded with 422
    981:  No GitHub Actions cache found for key: setup-uv-1-x86_64-unknown-linux-gnu-0.7.5-d92603d25acef1c08e643c37cc2475e5e190deb9690356b084828d60043a591f
    ...
    
    1460:  from .protocol.tasks import ExecutionInput, StepContext, StepOutcome
    1461:  File "/home/runner/work/julep/julep/agents-api/agents_api/common/protocol/tasks.py", line 27, in <module>
    1462:  from ...activities.pg_query_step import pg_query_step
    1463:  File "/home/runner/work/julep/julep/agents-api/agents_api/activities/pg_query_step.py", line 6, in <module>
    1464:  from .. import queries
    1465:  File "/home/runner/work/julep/julep/agents-api/agents_api/queries/__init__.py", line 9, in <module>
    1466:  from . import agents as agents
    1467:  File "/home/runner/work/julep/julep/agents-api/agents_api/queries/agents/__init__.py", line 13, in <module>
    1468:  from .create_agent import create_agent
    1469:  File "/home/runner/work/julep/julep/agents-api/agents_api/queries/agents/create_agent.py", line 15, in <module>
    1470:  from ...metrics.counters import query_metrics
    1471:  File "/home/runner/work/julep/julep/agents-api/agents_api/metrics/counters.py", line 8, in <module>
    1472:  from ..telemetry import telemetry
    1473:  File "/home/runner/work/julep/julep/agents-api/agents_api/telemetry.py", line 17, in <module>
    1474:  from opentelemetry.sdk.metrics.export import (
    1475:  ImportError: cannot import name 'OTLPMetricExporter' from 'opentelemetry.sdk.metrics.export' (/home/runner/work/julep/julep/agents-api/.venv/lib/python3.12/site-packages/opentelemetry/sdk/metrics/export/__init__.py). Did you mean: 'MetricExporter'?
    1476:  ##[error]Process completed with exit code 1.
    1477:  Post job cleanup.