[AI Evaluation] Failed to parse score for 'Groundedness' from the following evaluation response:
Description
Hi there! I'm reaching a similar error through Ollama (local model), Gemini, and via Amazon Bedrock, also it can be found by using CompositeEvaluator or directly by GroundednessEvaluator (and others)
Note: Via Gemini and Amazon Bedrock I'm using Semantic Kernel's connectors.
The error appears into the Diagnostics list.
Ollama
Here is an example just using Microsoft.Extensions.AI.*, with local Ollama: Microsoft.Extensions.AI.Evaluation.Tests.Ollama
Ollama Error details
Expected evaluationMetric.Interpretation?.Rating to be one of {EvaluationRating.Good {value: 5}, EvaluationRating.Exceptional {value: 6}}
because -------------------------------------
Failed: False
Reason:
Interpretation Reason:
Interpretation Rating: Inconclusive
Diagnostics Count: 1: Failed to parse score for 'Groundedness' from the following evaluation response:
Let's think step by step:
1. The CONTEXT provides information about the order ID (123) and the tracking code (TKG_ABC).
2. The QUERY is a direct question about the tracking for the order 123.
3. The RESPONSE directly answers the query by providing the tracking information for the order 123.
Explanation: The response is completely relevant to the context and query, providing the exact information requested. Therefore, the score should be [Groundedness: 5].
Score: 5
-------------------------------------
Query: What is the tracking for the order 123?
-------------------------------------
ChatResponse: OrderId is 123, Tracking code is TKG_ABC.
, but found EvaluationRating.Inconclusive {value: 1}.
at AwesomeAssertions.Execution.LateBoundTestFramework.Throw(String message)
at AwesomeAssertions.Execution.DefaultAssertionStrategy.HandleFailure(String message)
at AwesomeAssertions.Execution.AssertionScope.AddPreFormattedFailure(String formattedFailureMessage)
at AwesomeAssertions.Execution.AssertionChain.FailWith(Func`1 getFailureReason)
at AwesomeAssertions.Execution.AssertionChain.FailWith(Func`1 getFailureReason)
at AwesomeAssertions.Execution.AssertionChain.FailWith(String message, Object[] args)
at AwesomeAssertions.Primitives.EnumAssertions`2.BeOneOf(IEnumerable`1 validValues, String because, Object[] becauseArgs)
at Microsoft.Extensions.AI.Evaluation.Tests.Ollama.CompositeEvaluatorTests.CompositeEvaluatorWithGroundednessEvaluatorTest() in D:\Repositories\Microsoft.Extensions.AI.Evaluation.Tests\Microsoft.Extensions.AI.Evaluation.Tests.Ollama\CompositeEvaluatorTests.cs:line 55
--- End of stack trace from previous location ---
Gemini
Here is an example using Microsoft.Extensions.AI.* + Microsoft.SemanticKernel.Connectors.Google (which is currently in alpha version) with Gemini: Microsoft.Extensions.AI.Evaluation.Tests.Gemini - which is by default using gemini-2.5-pro.
Note: Not sure if the problem is coming from Semantic Kernel's connector or from Microsoft.Extensions.AI.Evaluation.*
Gemini Error Details
Expected evaluationMetric.Interpretation?.Rating to be one of {EvaluationRating.Good {value: 5}, EvaluationRating.Exceptional {value: 6}}
because -------------------------------------
Failed: False
Reason:
Interpretation Reason:
Interpretation Rating: Inconclusive
Diagnostics Count: 1: Failed to parse score for 'Groundedness' from the following evaluation response:
<S0>Let's think step by step:
1. **Analyze the Query:** The user wants to know the tracking code for a specific order, "order 123".
2. **Analyze the Context:** The context provides two pieces of information: "OrderId is 123"
-------------------------------------
Query: What is the tracking for the order 123?
-------------------------------------
ChatResponse: OrderId is 123, Tracking code is TKG_ABC.
, but found EvaluationRating.Inconclusive {value: 1}.
at AwesomeAssertions.Execution.LateBoundTestFramework.Throw(String message)
at AwesomeAssertions.Execution.DefaultAssertionStrategy.HandleFailure(String message)
at AwesomeAssertions.Execution.AssertionScope.AddPreFormattedFailure(String formattedFailureMessage)
at AwesomeAssertions.Execution.AssertionChain.FailWith(Func`1 getFailureReason)
at AwesomeAssertions.Execution.AssertionChain.FailWith(Func`1 getFailureReason)
at AwesomeAssertions.Execution.AssertionChain.FailWith(String message, Object[] args)
at AwesomeAssertions.Primitives.EnumAssertions`2.BeOneOf(IEnumerable`1 validValues, String because, Object[] becauseArgs)
at Microsoft.Extensions.AI.Evaluation.Tests.Gemini.CompositeEvaluatorTests.CompositeEvaluatorWithGroundednessEvaluatorTest() in D:\Repositories\Microsoft.Extensions.AI.Evaluation.Tests\Microsoft.Extensions.AI.Evaluation.Tests.Gemini\CompositeEvaluatorTests.cs:line 75
--- End of stack trace from previous location ---
Reproduction Steps
Ollama
As mentioned in the Ollama README:
- Run the ollama through docker:
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama - Pull the llama2 model:
docker exec -it ollama ollama pull llama2 - Run the tests
Gemini
As mentioned in the Gemini README
dotnet user-secrets init --project ./Microsoft.Extensions.AI.Evaluation.Tests.Gemini/Microsoft.Extensions.AI.Evaluation.Tests.Gemini.csprojdotnet user-secrets set "GeminiApiKey" "<your_gemini_key>" --project ./Microsoft.Extensions.AI.Evaluation.Tests.Gemini/Microsoft.Extensions.AI.Evaluation.Tests.Gemini.csproj- Run the tests
Expected behavior
The score, interpretation etc should be parsed correctly
Actual behavior
The score cannot be parsed
Regression?
No response
Known Workarounds
No response
Configuration
- Windows
- .NET 9 SDK
Other information
No response