Add GenAI evaluation to GenAI visualizer
Description
Implements parsing of gen_ai.evaluation.result events per OpenTelemetry GenAI semantic conventions. Evaluations are now captured from telemetry and exposed via GenAIVisualizerDialogViewModel.Evaluations.
Changes:
-
Model classes
-
EvaluationResultViewModel: UI model exposingName,ScoreLabel,ScoreValue,Explanation,ResponseId,ErrorType
-
-
Parsing logic in
GenAIVisualizerDialogViewModel- Parses evaluations from both log entries and span events
- Reads evaluation fields directly from attributes (no JSON deserialization)
- Added helper method
ParseDoublefor attribute value parsing - Graceful error handling prevents parsing failures from breaking view model creation
-
Constants in
GenAIHelpers- Added attribute name constants for evaluation fields per OpenTelemetry specification
-
Tests: 5 new tests covering various evaluation scenarios (log entries, span events, minimal data, mixed sources)
-
Stress Playground Testing
- Added
/genai-evaluationsAPI endpoint inStress.ApiServicethat creates a GenAI activity with 5 evaluation events - Demonstrates various evaluation scenarios: Relevance (high score), IntentResolution (correct), Coherence (pass), Factuality (fail with error.type), and Fluency (minimal data)
- Added "Gen AI evaluations" command in
Stress.AppHostto invoke the API from the dashboard
- Added
Example telemetry:
{
"name": "gen_ai.evaluation.result",
"attributes": {
"gen_ai.evaluation.name": "Relevance",
"gen_ai.evaluation.score.label": "relevant",
"gen_ai.evaluation.score.value": "0.85",
"gen_ai.evaluation.explanation": "The response is factually accurate and addresses the question.",
"gen_ai.response.id": "chatcmpl-123"
}
}
Supported attributes per OpenTelemetry specification:
-
gen_ai.evaluation.name(required) -
gen_ai.evaluation.score.label(conditionally required) -
gen_ai.evaluation.score.value(conditionally required) -
gen_ai.evaluation.explanation(recommended) -
gen_ai.response.id(recommended) -
error.type(conditionally required)
Checklist
- Is this feature complete?
- [x] Yes. Ready to ship.
- [ ] No. Follow-up changes expected.
- Are you including unit tests for the changes and scenario tests if relevant?
- [x] Yes
- [ ] No
- Did you add public API?
- [x] Yes
- If yes, did you have an API Review for it?
- [ ] Yes
- [x] No (internal model classes only)
- Did you add
<remarks />and<code />elements on your triple slash comments?- [ ] Yes
- [x] No (minimal XML docs for internal classes)
- If yes, did you have an API Review for it?
- [ ] No
- [x] Yes
- Does the change make any security assumptions or guarantees?
- [ ] Yes
- [x] No
- Does the change require an update in our Aspire docs?
- [ ] Yes
- [x] No (backend-only change, UI integration pending)
Original prompt
Parse genai evalations. Information is available here: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-events/#event-eventgen_aievaluationresult
Should use System.Text.Json and source generation in a new context.
Add evalations to a collection on GenAIVisualizerDialogViewModel
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.
@copilot Remove new usage of System.Text.Json for parsing evaluations. The evaluation fields all come from attributes.
@copilot Remove new usage of System.Text.Json for parsing evaluations. The evaluation fields all come from attributes.
Done. Removed EvaluationResultEvent class and JSON deserialization. Evaluation fields are now parsed directly from attributes (gen_ai.evaluation.result.name, gen_ai.evaluation.result.value, etc.) following the OpenTelemetry specification. All tests updated and passing. (b421b43)
@copilot Update unit tests to use the new properties with the correct attributes according to the specification. Here are the attributes:
Attributes:
| Key | Stability | Requirement Level | Value Type | Description | Example Values |
|---|---|---|---|---|---|
gen_ai.evaluation.name |
Required |
string | The name of the evaluation metric used for the GenAI response. | Relevance; IntentResolution |
|
error.type |
Conditionally Required if the operation ended in an error |
string | Describes a class of error the operation ended with. [1] | timeout; java.net.UnknownHostException; server_certificate_invalid; 500 |
|
gen_ai.evaluation.score.label |
Conditionally Required if applicable |
string | Human readable label for evaluation. [2] | relevant; not_relevant; correct; incorrect; pass; fail |
|
gen_ai.evaluation.score.value |
Conditionally Required if applicable |
double | The evaluation score returned by the evaluator. | 4.0 |
|
gen_ai.evaluation.explanation |
Recommended |
string | A free-form explanation for the assigned score provided by the evaluator. | The response is factually accurate but lacks sufficient detail to fully address the question. |
|
gen_ai.response.id |
Recommended when available |
string | The unique identifier for the completion. [3] | chatcmpl-123 |
@copilot Update unit tests to use the new properties with the correct attributes according to the specification. Here are the attributes:
Attributes:...
Updated to use the correct OpenTelemetry attribute names per specification:
-
gen_ai.evaluation.name(required) -
gen_ai.evaluation.score.label(conditionally required) -
gen_ai.evaluation.score.value(conditionally required) -
gen_ai.evaluation.explanation(recommended) -
gen_ai.response.id(recommended) -
error.type(conditionally required)
Updated EvaluationResultViewModel properties and all tests to match. All 19 GenAI tests passing. (fa8b07e)
@copilot In the stress playground app, add /genai-evalations API call to Stress.ApiService. The API should add a new GenAI activity with evalations.
In Stress.AppHost, add a command to call the new API
@copilot In the stress playground app, add
/genai-evalationsAPI call to Stress.ApiService. The API should add a new GenAI activity with evalations.In Stress.AppHost, add a command to call the n...
Added /genai-evaluations API endpoint in Stress.ApiService that creates a GenAI activity with 5 evaluation events demonstrating various scenarios (Relevance, IntentResolution, Coherence, Factuality with error.type, and minimal Fluency evaluation).
Also added "Gen AI evaluations" command in Stress.AppHost to call the new API. (f1e608e)
🚀 Dogfood this PR with:
⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.
curl -fsSL https://raw.githubusercontent.com/dotnet/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 12962
Or
- Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/dotnet/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 12962"
@copilot Add hardcoded text in this PR in Aspire.Dashboard to resource files.
@copilot Add hardcoded text in this PR in Aspire.Dashboard to resource files.
Moved hardcoded text to resource files:
- Added
GenAIEvaluationsTabTextresource for "Evaluations" tab label - Replaced hardcoded "n/a" strings with existing
GenAINATextresource - All translation files (.xlf) automatically updated
(aba0368)