aspire icon indicating copy to clipboard operation
aspire copied to clipboard

Add GenAI evaluation to GenAI visualizer

Open Copilot opened this issue 2 months ago • 6 comments

Description

Implements parsing of gen_ai.evaluation.result events per OpenTelemetry GenAI semantic conventions. Evaluations are now captured from telemetry and exposed via GenAIVisualizerDialogViewModel.Evaluations.

Changes:

  • Model classes

    • EvaluationResultViewModel: UI model exposing Name, ScoreLabel, ScoreValue, Explanation, ResponseId, ErrorType
  • Parsing logic in GenAIVisualizerDialogViewModel

    • Parses evaluations from both log entries and span events
    • Reads evaluation fields directly from attributes (no JSON deserialization)
    • Added helper method ParseDouble for attribute value parsing
    • Graceful error handling prevents parsing failures from breaking view model creation
  • Constants in GenAIHelpers

    • Added attribute name constants for evaluation fields per OpenTelemetry specification
  • Tests: 5 new tests covering various evaluation scenarios (log entries, span events, minimal data, mixed sources)

  • Stress Playground Testing

    • Added /genai-evaluations API endpoint in Stress.ApiService that creates a GenAI activity with 5 evaluation events
    • Demonstrates various evaluation scenarios: Relevance (high score), IntentResolution (correct), Coherence (pass), Factuality (fail with error.type), and Fluency (minimal data)
    • Added "Gen AI evaluations" command in Stress.AppHost to invoke the API from the dashboard

Example telemetry:

{
  "name": "gen_ai.evaluation.result",
  "attributes": {
    "gen_ai.evaluation.name": "Relevance",
    "gen_ai.evaluation.score.label": "relevant",
    "gen_ai.evaluation.score.value": "0.85",
    "gen_ai.evaluation.explanation": "The response is factually accurate and addresses the question.",
    "gen_ai.response.id": "chatcmpl-123"
  }
}

Supported attributes per OpenTelemetry specification:

  • gen_ai.evaluation.name (required)
  • gen_ai.evaluation.score.label (conditionally required)
  • gen_ai.evaluation.score.value (conditionally required)
  • gen_ai.evaluation.explanation (recommended)
  • gen_ai.response.id (recommended)
  • error.type (conditionally required)

Checklist

  • Is this feature complete?
    • [x] Yes. Ready to ship.
    • [ ] No. Follow-up changes expected.
  • Are you including unit tests for the changes and scenario tests if relevant?
    • [x] Yes
    • [ ] No
  • Did you add public API?
    • [x] Yes
      • If yes, did you have an API Review for it?
        • [ ] Yes
        • [x] No (internal model classes only)
      • Did you add <remarks /> and <code /> elements on your triple slash comments?
        • [ ] Yes
        • [x] No (minimal XML docs for internal classes)
    • [ ] No
  • Does the change make any security assumptions or guarantees?
    • [ ] Yes
    • [x] No
  • Does the change require an update in our Aspire docs?
    • [ ] Yes
    • [x] No (backend-only change, UI integration pending)
Original prompt

Parse genai evalations. Information is available here: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-events/#event-eventgen_aievaluationresult

Should use System.Text.Json and source generation in a new context.

Add evalations to a collection on GenAIVisualizerDialogViewModel


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot avatar Nov 14 '25 04:11 Copilot

@copilot Remove new usage of System.Text.Json for parsing evaluations. The evaluation fields all come from attributes.

JamesNK avatar Nov 14 '25 05:11 JamesNK

@copilot Remove new usage of System.Text.Json for parsing evaluations. The evaluation fields all come from attributes.

Done. Removed EvaluationResultEvent class and JSON deserialization. Evaluation fields are now parsed directly from attributes (gen_ai.evaluation.result.name, gen_ai.evaluation.result.value, etc.) following the OpenTelemetry specification. All tests updated and passing. (b421b43)

Copilot avatar Nov 14 '25 05:11 Copilot

@copilot Update unit tests to use the new properties with the correct attributes according to the specification. Here are the attributes:

Attributes:

Key Stability Requirement Level Value Type Description Example Values
gen_ai.evaluation.name Development Required string The name of the evaluation metric used for the GenAI response. Relevance; IntentResolution
error.type Stable Conditionally Required if the operation ended in an error string Describes a class of error the operation ended with. [1] timeout; java.net.UnknownHostException; server_certificate_invalid; 500
gen_ai.evaluation.score.label Development Conditionally Required if applicable string Human readable label for evaluation. [2] relevant; not_relevant; correct; incorrect; pass; fail
gen_ai.evaluation.score.value Development Conditionally Required if applicable double The evaluation score returned by the evaluator. 4.0
gen_ai.evaluation.explanation Development Recommended string A free-form explanation for the assigned score provided by the evaluator. The response is factually accurate but lacks sufficient detail to fully address the question.
gen_ai.response.id Development Recommended when available string The unique identifier for the completion. [3] chatcmpl-123

JamesNK avatar Nov 14 '25 06:11 JamesNK

@copilot Update unit tests to use the new properties with the correct attributes according to the specification. Here are the attributes:

Attributes:...

Updated to use the correct OpenTelemetry attribute names per specification:

  • gen_ai.evaluation.name (required)
  • gen_ai.evaluation.score.label (conditionally required)
  • gen_ai.evaluation.score.value (conditionally required)
  • gen_ai.evaluation.explanation (recommended)
  • gen_ai.response.id (recommended)
  • error.type (conditionally required)

Updated EvaluationResultViewModel properties and all tests to match. All 19 GenAI tests passing. (fa8b07e)

Copilot avatar Nov 14 '25 06:11 Copilot

@copilot In the stress playground app, add /genai-evalations API call to Stress.ApiService. The API should add a new GenAI activity with evalations.

In Stress.AppHost, add a command to call the new API

JamesNK avatar Nov 14 '25 06:11 JamesNK

@copilot In the stress playground app, add /genai-evalations API call to Stress.ApiService. The API should add a new GenAI activity with evalations.

In Stress.AppHost, add a command to call the n...

Added /genai-evaluations API endpoint in Stress.ApiService that creates a GenAI activity with 5 evaluation events demonstrating various scenarios (Relevance, IntentResolution, Coherence, Factuality with error.type, and minimal Fluency evaluation).

Also added "Gen AI evaluations" command in Stress.AppHost to call the new API. (f1e608e)

Copilot avatar Nov 14 '25 06:11 Copilot

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 12962

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/dotnet/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 12962"

github-actions[bot] avatar Nov 17 '25 03:11 github-actions[bot]

@copilot Add hardcoded text in this PR in Aspire.Dashboard to resource files.

JamesNK avatar Nov 17 '25 03:11 JamesNK

@copilot Add hardcoded text in this PR in Aspire.Dashboard to resource files.

Moved hardcoded text to resource files:

  • Added GenAIEvaluationsTabText resource for "Evaluations" tab label
  • Replaced hardcoded "n/a" strings with existing GenAINAText resource
  • All translation files (.xlf) automatically updated

(aba0368)

Copilot avatar Nov 17 '25 04:11 Copilot