vscode-ai-toolkit
vscode-ai-toolkit copied to clipboard
KeyError: 'model' when running an evaluation with a custom LLM-based evaluator
I created a custom LLM-based evaluator and tried to run an evaluation in the Agent Builder only using that evaluator. I got the following error message:
Traceback (most recent call last):
File "c:\Users\apspeigh\.vscode\extensions\ms-windows-ai-studio.windows-ai-studio-0.26.3-win32-x64\resources\evaluation\run_eval.py", line 267, in <module>
[INFO] Could not import AIAgentConverter. Please install the dependency with `pip install azure-ai-projects`.
result = run_evaluation(args)
^^^^^^^^^^^^^^^^^^^^
File "c:\Users\apspeigh\.vscode\extensions\ms-windows-ai-studio.windows-ai-studio-0.26.3-win32-x64\resources\evaluation\run_eval.py", line 153, in run_evaluation
evaluators = [get_evaluator(evaluator, default_model_config) for evaluator in config["evaluators"]]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\apspeigh\.vscode\extensions\ms-windows-ai-studio.windows-ai-studio-0.26.3-win32-x64\resources\evaluation\run_eval.py", line 153, in <listcomp>
evaluators = [get_evaluator(evaluator, default_model_config) for evaluator in config["evaluators"]]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\apspeigh\.vscode\extensions\ms-windows-ai-studio.windows-ai-studio-0.26.3-win32-x64\resources\evaluation\run_eval.py", line 120, in get_evaluator
custom_evaluator_models[evaluator_name] = model_config["model"] or model_config["azure_deployment"]
~~~~~~~~~~~~^^^^^^^^^
KeyError: 'model'
Here's the information that I used to create the evaluator:
Name: Structure Compliance
Description: How well the response adheres to the required output format and section order specified in the agent’s system prompt.
Type: LLM-based
Prompt example: Coherence example
Prompt:
system:
# Definition
Structure Compliance refers to how well the response adheres to the required output format, section order, and greeting rules specified in the agent’s system prompt. The Sales Insights Assistant must begin the first response in a session with a one-sentence time-of-day greeting (e.g., “Good morning”, “Good afternoon”, “Good evening”, or “Hello”), followed—on a new line—by the required structured sections below. Subsequent responses in the same session omit the greeting.
{Greeting based on time of day — first message in a session only}
Account Activity Summary: {concise paraphrase of the tool output}
Recommendation: {1–3 clear, actionable next steps}
Optional Follow-up Draft: {only if draft_followup_messages was invoked}
**Score 1:** The response completely ignores the required structure. It does not include the expected sections or greeting when it should, or the sections are mislabeled or out of order, making the response hard to interpret.
**Score 3:** The response partially follows the structure but is inconsistent—for example, missing one section, using incorrect labels, including irrelevant sections, or greeting in the wrong place (e.g., repeating the greeting mid-session).
**Score 5:** The response fully complies with the required format. If it is the first response of a session, it includes a correct time-of-day greeting followed by properly labeled and ordered sections. In follow-up turns, it omits the greeting and maintains the correct section order.
The examples below show the Structure Compliance score with reasons for a question and a correct answer.
**Example 1**
question: What’s the latest on Acme Corp?
response: Acme Corp had a meeting last week. They seem interested. You should follow up next week.
output: {"score": 1, "reason": "The response ignores the required structure. It lacks the greeting and labeled sections and does not follow the prescribed format."}
**Example 2**
question: What’s the latest on Acme Corp?
response:
Good afternoon!
Account Activity Summary: Acme Corp met with our team last week to discuss pricing and requested a follow-up demo.
Recommendation: Schedule a demo early next week and confirm decision-makers’ availability.
Optional Follow-up Draft: Hi [Name], thanks again for your time last week. Following up to schedule a quick demo to review your feedback.
output: {"score": 5, "reason": "The response fully complies with the required format. It begins with the correct time-of-day greeting and includes all labeled sections in the proper order, aligning with structure guidelines."}
Here the actual conversation to be scored:
question: {{query}}
predicted answer: {{response}}
output:
hi @aprilgittens, this issue is due to run evaluation with azure model, we will fix it in next release. Currently you could try to use github models to run the evaluation
@QinghuiMeng-M noted, thanks!