[AI Evaluation] Failed to parse score for 'Groundedness' from the following evaluation response:

Open henriqueholtz opened this issue 6 months ago • 4 comments

Description

Hi there! I'm reaching a similar error through Ollama (local model), Gemini, and via Amazon Bedrock, also it can be found by using CompositeEvaluator or directly by GroundednessEvaluator (and others)

Note: Via Gemini and Amazon Bedrock I'm using Semantic Kernel's connectors.

The error appears into the Diagnostics list.

Ollama

Here is an example just using Microsoft.Extensions.AI.*, with local Ollama: Microsoft.Extensions.AI.Evaluation.Tests.Ollama

Ollama Error details

Expected evaluationMetric.Interpretation?.Rating to be one of {EvaluationRating.Good {value: 5}, EvaluationRating.Exceptional {value: 6}}
because -------------------------------------
Failed: False
Reason:
Interpretation Reason:
Interpretation Rating: Inconclusive
Diagnostics Count: 1: Failed to parse score for 'Groundedness' from the following evaluation response:
Let's think step by step:

1. The CONTEXT provides information about the order ID (123) and the tracking code (TKG_ABC).
2. The QUERY is a direct question about the tracking for the order 123.
3. The RESPONSE directly answers the query by providing the tracking information for the order 123.

Explanation: The response is completely relevant to the context and query, providing the exact information requested. Therefore, the score should be [Groundedness: 5].

Score: 5

-------------------------------------
Query: What is the tracking for the order 123?

-------------------------------------
ChatResponse: OrderId is 123, Tracking code is TKG_ABC.
, but found EvaluationRating.Inconclusive {value: 1}.
   at AwesomeAssertions.Execution.LateBoundTestFramework.Throw(String message)
   at AwesomeAssertions.Execution.DefaultAssertionStrategy.HandleFailure(String message)
   at AwesomeAssertions.Execution.AssertionScope.AddPreFormattedFailure(String formattedFailureMessage)
   at AwesomeAssertions.Execution.AssertionChain.FailWith(Func`1 getFailureReason)
   at AwesomeAssertions.Execution.AssertionChain.FailWith(Func`1 getFailureReason)
   at AwesomeAssertions.Execution.AssertionChain.FailWith(String message, Object[] args)
   at AwesomeAssertions.Primitives.EnumAssertions`2.BeOneOf(IEnumerable`1 validValues, String because, Object[] becauseArgs)
   at Microsoft.Extensions.AI.Evaluation.Tests.Ollama.CompositeEvaluatorTests.CompositeEvaluatorWithGroundednessEvaluatorTest() in D:\Repositories\Microsoft.Extensions.AI.Evaluation.Tests\Microsoft.Extensions.AI.Evaluation.Tests.Ollama\CompositeEvaluatorTests.cs:line 55
--- End of stack trace from previous location ---

Gemini

Here is an example using Microsoft.Extensions.AI.* + Microsoft.SemanticKernel.Connectors.Google (which is currently in alpha version) with Gemini: Microsoft.Extensions.AI.Evaluation.Tests.Gemini - which is by default using gemini-2.5-pro.

Note: Not sure if the problem is coming from Semantic Kernel's connector or from Microsoft.Extensions.AI.Evaluation.*

Gemini Error Details

Expected evaluationMetric.Interpretation?.Rating to be one of {EvaluationRating.Good {value: 5}, EvaluationRating.Exceptional {value: 6}}
because -------------------------------------
Failed: False
Reason:
Interpretation Reason:
Interpretation Rating: Inconclusive
Diagnostics Count: 1: Failed to parse score for 'Groundedness' from the following evaluation response:
<S0>Let's think step by step:
1.  **Analyze the Query:** The user wants to know the tracking code for a specific order, "order 123".
2.  **Analyze the Context:** The context provides two pieces of information: "OrderId is 123"

-------------------------------------
Query: What is the tracking for the order 123?

-------------------------------------
ChatResponse: OrderId is 123, Tracking code is TKG_ABC.
, but found EvaluationRating.Inconclusive {value: 1}.
   at AwesomeAssertions.Execution.LateBoundTestFramework.Throw(String message)
   at AwesomeAssertions.Execution.DefaultAssertionStrategy.HandleFailure(String message)
   at AwesomeAssertions.Execution.AssertionScope.AddPreFormattedFailure(String formattedFailureMessage)
   at AwesomeAssertions.Execution.AssertionChain.FailWith(Func`1 getFailureReason)
   at AwesomeAssertions.Execution.AssertionChain.FailWith(Func`1 getFailureReason)
   at AwesomeAssertions.Execution.AssertionChain.FailWith(String message, Object[] args)
   at AwesomeAssertions.Primitives.EnumAssertions`2.BeOneOf(IEnumerable`1 validValues, String because, Object[] becauseArgs)
   at Microsoft.Extensions.AI.Evaluation.Tests.Gemini.CompositeEvaluatorTests.CompositeEvaluatorWithGroundednessEvaluatorTest() in D:\Repositories\Microsoft.Extensions.AI.Evaluation.Tests\Microsoft.Extensions.AI.Evaluation.Tests.Gemini\CompositeEvaluatorTests.cs:line 75
--- End of stack trace from previous location ---

Reproduction Steps

Ollama

As mentioned in the Ollama README:

Run the ollama through docker: docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
Pull the llama2 model: docker exec -it ollama ollama pull llama2
Run the tests

Gemini

As mentioned in the Gemini README

dotnet user-secrets init --project ./Microsoft.Extensions.AI.Evaluation.Tests.Gemini/Microsoft.Extensions.AI.Evaluation.Tests.Gemini.csproj
dotnet user-secrets set "GeminiApiKey" "<your_gemini_key>" --project ./Microsoft.Extensions.AI.Evaluation.Tests.Gemini/Microsoft.Extensions.AI.Evaluation.Tests.Gemini.csproj
Run the tests

Expected behavior

The score, interpretation etc should be parsed correctly

Actual behavior

The score cannot be parsed

Regression?

No response

Known Workarounds

No response

Configuration

Windows
.NET 9 SDK

Other information

No response

Jul 02 '25 15:07 henriqueholtz