dd-trace-py icon indicating copy to clipboard operation
dd-trace-py copied to clipboard

feat(llmobs): add boolean eval

Open jpgarc97 opened this issue 5 months ago • 3 comments

Add boolean metric type support and standardize error handling in LLMObs

This PR adds support for boolean metric types in LLMObs evaluation metrics and standardizes error handling across all metric types:

New Features:

  1. Added boolean metric type support:
    • Users can now submit boolean values alongside existing categorical and score metrics
    • Added tests to verify boolean metric submission works correctly
    • Added tests to verify type validation for boolean values

Consistency Improvements: 2. Standardized error handling across all metric types:

  • All metric type mismatches now consistently raise TypeError (instead of mixed warnings/errors)
  • Categorical metrics: must be strings → raises TypeError
  • Score metrics: must be integers or floats → raises TypeError
  • Boolean metrics: must be booleans → raises TypeError

This addresses the consistency concern raised in review comments and provides a better developer experience with clear error messages.

Note: Some existing tests may need updates as they were expecting warnings for validation errors, but now get consistent TypeError exceptions.

Checklist

  • [x] PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • [x] Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

jpgarc97 avatar Jun 18 '25 19:06 jpgarc97

CODEOWNERS have been resolved as:

releasenotes/notes/add-boolean-metric-type-llmobs-4f7a9b2c1d3e.yaml     @DataDog/apm-python
ddtrace/llmobs/_llmobs.py                                               @DataDog/ml-observability
ddtrace/llmobs/_telemetry.py                                            @DataDog/ml-observability
ddtrace/llmobs/_writer.py                                               @DataDog/ml-observability
tests/llmobs/_utils.py                                                  @DataDog/ml-observability
tests/llmobs/test_llmobs_service.py                                     @DataDog/ml-observability

github-actions[bot] avatar Jun 18 '25 20:06 github-actions[bot]

Bootstrap import analysis

Comparison of import times between this PR and base.

Summary

The average import time from this PR is: 272 ± 2 ms.

The average import time from base is: 272 ± 3 ms.

The import time difference between this PR and base is: -0.6 ± 0.1 ms.

Import time breakdown

The following import paths have shrunk:

ddtrace.auto 1.829 ms (0.67%)
ddtrace.bootstrap.sitecustomize 1.167 ms (0.43%)
ddtrace.bootstrap.preload 1.167 ms (0.43%)
ddtrace.internal.remoteconfig.client 0.610 ms (0.22%)
ddtrace 0.661 ms (0.24%)
ddtrace.internal._unpatched 0.028 ms (0.01%)
json 0.028 ms (0.01%)
json.decoder 0.028 ms (0.01%)
re 0.028 ms (0.01%)
enum 0.028 ms (0.01%)
types 0.028 ms (0.01%)

github-actions[bot] avatar Jun 18 '25 20:06 github-actions[bot]

Benchmarks

Benchmark execution time: 2025-07-03 14:47:59

Comparing candidate commit 773707dae3cbd3d0c6e7d4f6db6f32b788a2fc59 in PR branch juan.garcia/MLOB-2715/boolean-value with baseline commit 427c82b807ad5e3e4857577df2675808e101420d in branch main.

Found 0 performance improvements and 2 performance regressions! Performance is the same for 544 metrics, 4 unstable metrics.

scenario:iastaspectsospath-ospathbasename_aspect

  • 🟥 execution_time [+445.191ns; +681.858ns] or [+10.489%; +16.065%]

scenario:iastaspectsospath-ospathjoin_aspect

  • 🟥 execution_time [+0.955µs; +1.054µs] or [+15.560%; +17.168%]

pr-commenter[bot] avatar Jun 18 '25 20:06 pr-commenter[bot]