extensions icon indicating copy to clipboard operation
extensions copied to clipboard

[AI Evaluation] Failed to parse score for 'Groundedness' from the following evaluation response:

Open henriqueholtz opened this issue 6 months ago • 4 comments

Description

Hi there! I'm reaching a similar error through Ollama (local model), Gemini, and via Amazon Bedrock, also it can be found by using CompositeEvaluator or directly by GroundednessEvaluator (and others)

Note: Via Gemini and Amazon Bedrock I'm using Semantic Kernel's connectors.

The error appears into the Diagnostics list.

Ollama

Here is an example just using Microsoft.Extensions.AI.*, with local Ollama: Microsoft.Extensions.AI.Evaluation.Tests.Ollama

Ollama Error details

Expected evaluationMetric.Interpretation?.Rating to be one of {EvaluationRating.Good {value: 5}, EvaluationRating.Exceptional {value: 6}}
because -------------------------------------
Failed: False
Reason:
Interpretation Reason:
Interpretation Rating: Inconclusive
Diagnostics Count: 1: Failed to parse score for 'Groundedness' from the following evaluation response:
Let's think step by step:

1. The CONTEXT provides information about the order ID (123) and the tracking code (TKG_ABC).
2. The QUERY is a direct question about the tracking for the order 123.
3. The RESPONSE directly answers the query by providing the tracking information for the order 123.

Explanation: The response is completely relevant to the context and query, providing the exact information requested. Therefore, the score should be [Groundedness: 5].

Score: 5

-------------------------------------
Query: What is the tracking for the order 123?

-------------------------------------
ChatResponse: OrderId is 123, Tracking code is TKG_ABC.
, but found EvaluationRating.Inconclusive {value: 1}.
   at AwesomeAssertions.Execution.LateBoundTestFramework.Throw(String message)
   at AwesomeAssertions.Execution.DefaultAssertionStrategy.HandleFailure(String message)
   at AwesomeAssertions.Execution.AssertionScope.AddPreFormattedFailure(String formattedFailureMessage)
   at AwesomeAssertions.Execution.AssertionChain.FailWith(Func`1 getFailureReason)
   at AwesomeAssertions.Execution.AssertionChain.FailWith(Func`1 getFailureReason)
   at AwesomeAssertions.Execution.AssertionChain.FailWith(String message, Object[] args)
   at AwesomeAssertions.Primitives.EnumAssertions`2.BeOneOf(IEnumerable`1 validValues, String because, Object[] becauseArgs)
   at Microsoft.Extensions.AI.Evaluation.Tests.Ollama.CompositeEvaluatorTests.CompositeEvaluatorWithGroundednessEvaluatorTest() in D:\Repositories\Microsoft.Extensions.AI.Evaluation.Tests\Microsoft.Extensions.AI.Evaluation.Tests.Ollama\CompositeEvaluatorTests.cs:line 55
--- End of stack trace from previous location ---

Gemini

Here is an example using Microsoft.Extensions.AI.* + Microsoft.SemanticKernel.Connectors.Google (which is currently in alpha version) with Gemini: Microsoft.Extensions.AI.Evaluation.Tests.Gemini - which is by default using gemini-2.5-pro.

Note: Not sure if the problem is coming from Semantic Kernel's connector or from Microsoft.Extensions.AI.Evaluation.*

Gemini Error Details

Expected evaluationMetric.Interpretation?.Rating to be one of {EvaluationRating.Good {value: 5}, EvaluationRating.Exceptional {value: 6}}
because -------------------------------------
Failed: False
Reason:
Interpretation Reason:
Interpretation Rating: Inconclusive
Diagnostics Count: 1: Failed to parse score for 'Groundedness' from the following evaluation response:
<S0>Let's think step by step:
1.  **Analyze the Query:** The user wants to know the tracking code for a specific order, "order 123".
2.  **Analyze the Context:** The context provides two pieces of information: "OrderId is 123"

-------------------------------------
Query: What is the tracking for the order 123?

-------------------------------------
ChatResponse: OrderId is 123, Tracking code is TKG_ABC.
, but found EvaluationRating.Inconclusive {value: 1}.
   at AwesomeAssertions.Execution.LateBoundTestFramework.Throw(String message)
   at AwesomeAssertions.Execution.DefaultAssertionStrategy.HandleFailure(String message)
   at AwesomeAssertions.Execution.AssertionScope.AddPreFormattedFailure(String formattedFailureMessage)
   at AwesomeAssertions.Execution.AssertionChain.FailWith(Func`1 getFailureReason)
   at AwesomeAssertions.Execution.AssertionChain.FailWith(Func`1 getFailureReason)
   at AwesomeAssertions.Execution.AssertionChain.FailWith(String message, Object[] args)
   at AwesomeAssertions.Primitives.EnumAssertions`2.BeOneOf(IEnumerable`1 validValues, String because, Object[] becauseArgs)
   at Microsoft.Extensions.AI.Evaluation.Tests.Gemini.CompositeEvaluatorTests.CompositeEvaluatorWithGroundednessEvaluatorTest() in D:\Repositories\Microsoft.Extensions.AI.Evaluation.Tests\Microsoft.Extensions.AI.Evaluation.Tests.Gemini\CompositeEvaluatorTests.cs:line 75
--- End of stack trace from previous location ---


Reproduction Steps

Ollama

As mentioned in the Ollama README:

  • Run the ollama through docker: docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
  • Pull the llama2 model: docker exec -it ollama ollama pull llama2
  • Run the tests

Gemini

As mentioned in the Gemini README

  1. dotnet user-secrets init --project ./Microsoft.Extensions.AI.Evaluation.Tests.Gemini/Microsoft.Extensions.AI.Evaluation.Tests.Gemini.csproj
  2. dotnet user-secrets set "GeminiApiKey" "<your_gemini_key>" --project ./Microsoft.Extensions.AI.Evaluation.Tests.Gemini/Microsoft.Extensions.AI.Evaluation.Tests.Gemini.csproj
  3. Run the tests

Expected behavior

The score, interpretation etc should be parsed correctly

Actual behavior

The score cannot be parsed

Regression?

No response

Known Workarounds

No response

Configuration

  • Windows
  • .NET 9 SDK

Other information

No response

henriqueholtz avatar Jul 02 '25 15:07 henriqueholtz