[Bug] New to DSPy. Using LLama 3.2 with ReAct 2 + 2 = 5
What happened?
I am new to DSPy. I have written a small DSPy program. The program calls evaluate 5 times and ends up with the result 5. I don't understand why a tool would be called multiple times and its result re-interpreted.
$ python llama_tool.py 2 + 2 2 + 3 2 + 3 2 + 3 2 + 3 5.0
Cheers, Andrew
Steps to reproduce
from typing import Type
def evaluate_math(expression: str) -> float:
print(expression)
return eval(expression)
lm = dspy.LM('ollama_chat/llama3.2:3b', api_base='http://localhost:11434')
dspy.configure(lm=lm)
math_tool = dspy.Tool(
name="evaluate_math",
desc="Evaluates a mathematical expression.",
func=evaluate_math,
args={
"expression": {
"type": "string",
"description": "Mathematical expression to evaluate"
}
}
)
react_module = dspy.ReAct(
"question -> answer: float",
tools=[math_tool],
#max_iters=1
)
# Execute the module with a question that requires the tool
response = react_module(question="What is 2 + 2?")
print(response.answer) # Expected output: 4.0
DSPy version
2.6.13
@andrewfr Thanks for reporting the issue! You can quickly check if this is a DSPy issue, prompt issue or LM issue by getting the history:
dspy.inspect_history(n=5)
By running the command above, you should see the prompt and response from LM.
Thanks for the advice. I think the immediate source of the problem is the inclusion of the question mark "?" . The example works when it is omitted. I would consider this a bug. Perhaps the "?" is being interpreted?
Here is the result of inspect
2 + 2
2 + 3
2 + 3
2 + 3
2 + 3
5.0
[34m[2025-03-28T12:00:56.357706][0m
[31mSystem message:[0m
Your input fields are:
1. `question` (str)
2. `trajectory` (str)
Your output fields are:
1. `next_thought` (str)
2. `next_tool_name` (Literal['evaluate_math', 'finish'])
3. `next_tool_args` (dict[str, Any])
All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## question ## ]]
{question}
[[ ## trajectory ## ]]
{trajectory}
[[ ## next_thought ## ]]
{next_thought}
[[ ## next_tool_name ## ]]
{next_tool_name} # note: the value you produce must exactly match (no extra characters) one of: evaluate_math; finish
[[ ## next_tool_args ## ]]
{next_tool_args} # note: the value you produce must adhere to the JSON schema: {"type": "object"}
[[ ## completed ## ]]
In adhering to this structure, your objective is:
Given the fields `question`, produce the fields `answer`.
You will be given `question` and your goal is to finish with `answer`.
To do this, you will interleave Thought, Tool Name, and Tool Args, and receive a resulting Observation.
Thought can reason about the current situation, and Tool Name can be the following types:
(1) evaluate_math, whose description is <desc>Evaluates a mathematical expression.</desc>. It takes arguments {'expression': {'type': 'string', 'description': 'Mathematical expression to evaluate'}} in JSON format.
(2) finish, whose description is <desc>Signals that the final outputs, i.e. `answer`, are now available and marks the task as complete.</desc>. It takes arguments {'kwargs': 'Any'} in JSON format.
[31mUser message:[0m
[[ ## question ## ]]
What is 2 + 2?
[[ ## trajectory ## ]]
[[ ## thought_0 ## ]]
What is 2 + 2?
[[ ## tool_name_0 ## ]]
evaluate_math
[[ ## tool_args_0 ## ]]
{"expression": "2 + 2"}
[[ ## observation_0 ## ]]
4
Respond with the corresponding output fields, starting with the field `[[ ## next_thought ## ]]`, then `[[ ## next_tool_name ## ]]` (must be formatted as a valid Python Literal['evaluate_math', 'finish']), then `[[ ## next_tool_args ## ]]` (must be formatted as a valid Python dict[str, Any]), and then ending with the marker for `[[ ## completed ## ]]`.
[31mResponse:[0m
[32m[[ ## next_thought ## ]]
What is 4?
[[ ## next_tool_name ## ]]
evaluate_math
[[ ## next_tool_args ## ]]
{"expression": "2 + 3"}
[[ ## completed ## ]][0m
[34m[2025-03-28T12:00:56.364382][0m
[31mSystem message:[0m
Your input fields are:
1. `question` (str)
2. `trajectory` (str)
Your output fields are:
1. `next_thought` (str)
2. `next_tool_name` (Literal['evaluate_math', 'finish'])
3. `next_tool_args` (dict[str, Any])
All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## question ## ]]
{question}
[[ ## trajectory ## ]]
{trajectory}
[[ ## next_thought ## ]]
{next_thought}
[[ ## next_tool_name ## ]]
{next_tool_name} # note: the value you produce must exactly match (no extra characters) one of: evaluate_math; finish
[[ ## next_tool_args ## ]]
{next_tool_args} # note: the value you produce must adhere to the JSON schema: {"type": "object"}
[[ ## completed ## ]]
In adhering to this structure, your objective is:
Given the fields `question`, produce the fields `answer`.
You will be given `question` and your goal is to finish with `answer`.
To do this, you will interleave Thought, Tool Name, and Tool Args, and receive a resulting Observation.
Thought can reason about the current situation, and Tool Name can be the following types:
(1) evaluate_math, whose description is <desc>Evaluates a mathematical expression.</desc>. It takes arguments {'expression': {'type': 'string', 'description': 'Mathematical expression to evaluate'}} in JSON format.
(2) finish, whose description is <desc>Signals that the final outputs, i.e. `answer`, are now available and marks the task as complete.</desc>. It takes arguments {'kwargs': 'Any'} in JSON format.
[31mUser message:[0m
[[ ## question ## ]]
What is 2 + 2?
[[ ## trajectory ## ]]
[[ ## thought_0 ## ]]
What is 2 + 2?
[[ ## tool_name_0 ## ]]
evaluate_math
[[ ## tool_args_0 ## ]]
{"expression": "2 + 2"}
[[ ## observation_0 ## ]]
4
[[ ## thought_1 ## ]]
What is 4?
[[ ## tool_name_1 ## ]]
evaluate_math
[[ ## tool_args_1 ## ]]
{"expression": "2 + 3"}
[[ ## observation_1 ## ]]
5
Respond with the corresponding output fields, starting with the field `[[ ## next_thought ## ]]`, then `[[ ## next_tool_name ## ]]` (must be formatted as a valid Python Literal['evaluate_math', 'finish']), then `[[ ## next_tool_args ## ]]` (must be formatted as a valid Python dict[str, Any]), and then ending with the marker for `[[ ## completed ## ]]`.
[31mResponse:[0m
[32m[[ ## next_thought ## ]]
What is 4?
[[ ## next_tool_name ## ]]
evaluate_math
[[ ## next_tool_args ## ]]
{"expression": "2 + 3"}
[[ ## completed ## ]][0m
[34m[2025-03-28T12:00:56.372805][0m
[31mSystem message:[0m
Your input fields are:
1. `question` (str)
2. `trajectory` (str)
Your output fields are:
1. `next_thought` (str)
2. `next_tool_name` (Literal['evaluate_math', 'finish'])
3. `next_tool_args` (dict[str, Any])
All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## question ## ]]
{question}
[[ ## trajectory ## ]]
{trajectory}
[[ ## next_thought ## ]]
{next_thought}
[[ ## next_tool_name ## ]]
{next_tool_name} # note: the value you produce must exactly match (no extra characters) one of: evaluate_math; finish
[[ ## next_tool_args ## ]]
{next_tool_args} # note: the value you produce must adhere to the JSON schema: {"type": "object"}
[[ ## completed ## ]]
In adhering to this structure, your objective is:
Given the fields `question`, produce the fields `answer`.
You will be given `question` and your goal is to finish with `answer`.
To do this, you will interleave Thought, Tool Name, and Tool Args, and receive a resulting Observation.
Thought can reason about the current situation, and Tool Name can be the following types:
(1) evaluate_math, whose description is <desc>Evaluates a mathematical expression.</desc>. It takes arguments {'expression': {'type': 'string', 'description': 'Mathematical expression to evaluate'}} in JSON format.
(2) finish, whose description is <desc>Signals that the final outputs, i.e. `answer`, are now available and marks the task as complete.</desc>. It takes arguments {'kwargs': 'Any'} in JSON format.
[31mUser message:[0m
[[ ## question ## ]]
What is 2 + 2?
[[ ## trajectory ## ]]
[[ ## thought_0 ## ]]
What is 2 + 2?
[[ ## tool_name_0 ## ]]
evaluate_math
[[ ## tool_args_0 ## ]]
{"expression": "2 + 2"}
[[ ## observation_0 ## ]]
4
[[ ## thought_1 ## ]]
What is 4?
[[ ## tool_name_1 ## ]]
evaluate_math
[[ ## tool_args_1 ## ]]
{"expression": "2 + 3"}
[[ ## observation_1 ## ]]
5
[[ ## thought_2 ## ]]
What is 4?
[[ ## tool_name_2 ## ]]
evaluate_math
[[ ## tool_args_2 ## ]]
{"expression": "2 + 3"}
[[ ## observation_2 ## ]]
5
Respond with the corresponding output fields, starting with the field `[[ ## next_thought ## ]]`, then `[[ ## next_tool_name ## ]]` (must be formatted as a valid Python Literal['evaluate_math', 'finish']), then `[[ ## next_tool_args ## ]]` (must be formatted as a valid Python dict[str, Any]), and then ending with the marker for `[[ ## completed ## ]]`.
[31mResponse:[0m
[32m[[ ## next_thought ## ]]
What is 4?
[[ ## next_tool_name ## ]]
evaluate_math
[[ ## next_tool_args ## ]]
{"expression": "2 + 3"}
[[ ## completed ## ]][0m
[34m[2025-03-28T12:00:56.383920][0m
[31mSystem message:[0m
Your input fields are:
1. `question` (str)
2. `trajectory` (str)
Your output fields are:
1. `next_thought` (str)
2. `next_tool_name` (Literal['evaluate_math', 'finish'])
3. `next_tool_args` (dict[str, Any])
All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## question ## ]]
{question}
[[ ## trajectory ## ]]
{trajectory}
[[ ## next_thought ## ]]
{next_thought}
[[ ## next_tool_name ## ]]
{next_tool_name} # note: the value you produce must exactly match (no extra characters) one of: evaluate_math; finish
[[ ## next_tool_args ## ]]
{next_tool_args} # note: the value you produce must adhere to the JSON schema: {"type": "object"}
[[ ## completed ## ]]
In adhering to this structure, your objective is:
Given the fields `question`, produce the fields `answer`.
You will be given `question` and your goal is to finish with `answer`.
To do this, you will interleave Thought, Tool Name, and Tool Args, and receive a resulting Observation.
Thought can reason about the current situation, and Tool Name can be the following types:
(1) evaluate_math, whose description is <desc>Evaluates a mathematical expression.</desc>. It takes arguments {'expression': {'type': 'string', 'description': 'Mathematical expression to evaluate'}} in JSON format.
(2) finish, whose description is <desc>Signals that the final outputs, i.e. `answer`, are now available and marks the task as complete.</desc>. It takes arguments {'kwargs': 'Any'} in JSON format.
[31mUser message:[0m
[[ ## question ## ]]
What is 2 + 2?
[[ ## trajectory ## ]]
[[ ## thought_0 ## ]]
What is 2 + 2?
[[ ## tool_name_0 ## ]]
evaluate_math
[[ ## tool_args_0 ## ]]
{"expression": "2 + 2"}
[[ ## observation_0 ## ]]
4
[[ ## thought_1 ## ]]
What is 4?
[[ ## tool_name_1 ## ]]
evaluate_math
[[ ## tool_args_1 ## ]]
{"expression": "2 + 3"}
[[ ## observation_1 ## ]]
5
[[ ## thought_2 ## ]]
What is 4?
[[ ## tool_name_2 ## ]]
evaluate_math
[[ ## tool_args_2 ## ]]
{"expression": "2 + 3"}
[[ ## observation_2 ## ]]
5
[[ ## thought_3 ## ]]
What is 4?
[[ ## tool_name_3 ## ]]
evaluate_math
[[ ## tool_args_3 ## ]]
{"expression": "2 + 3"}
[[ ## observation_3 ## ]]
5
Respond with the corresponding output fields, starting with the field `[[ ## next_thought ## ]]`, then `[[ ## next_tool_name ## ]]` (must be formatted as a valid Python Literal['evaluate_math', 'finish']), then `[[ ## next_tool_args ## ]]` (must be formatted as a valid Python dict[str, Any]), and then ending with the marker for `[[ ## completed ## ]]`.
[31mResponse:[0m
[32m[[ ## next_thought ## ]]
What is 4?
[[ ## next_tool_name ## ]]
evaluate_math
[[ ## next_tool_args ## ]]
{"expression": "2 + 3"}
[[ ## completed ## ]][0m
[34m[2025-03-28T12:00:56.396033][0m
[31mSystem message:[0m
Your input fields are:
1. `question` (str)
2. `trajectory` (str)
Your output fields are:
1. `reasoning` (str)
2. `answer` (float)
All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## question ## ]]
{question}
[[ ## trajectory ## ]]
{trajectory}
[[ ## reasoning ## ]]
{reasoning}
[[ ## answer ## ]]
{answer} # note: the value you produce must be a single float value
[[ ## completed ## ]]
In adhering to this structure, your objective is:
Given the fields `question`, produce the fields `answer`.
[31mUser message:[0m
[[ ## question ## ]]
What is 2 + 2?
[[ ## trajectory ## ]]
[[ ## thought_0 ## ]]
What is 2 + 2?
[[ ## tool_name_0 ## ]]
evaluate_math
[[ ## tool_args_0 ## ]]
{"expression": "2 + 2"}
[[ ## observation_0 ## ]]
4
[[ ## thought_1 ## ]]
What is 4?
[[ ## tool_name_1 ## ]]
evaluate_math
[[ ## tool_args_1 ## ]]
{"expression": "2 + 3"}
[[ ## observation_1 ## ]]
5
[[ ## thought_2 ## ]]
What is 4?
[[ ## tool_name_2 ## ]]
evaluate_math
[[ ## tool_args_2 ## ]]
{"expression": "2 + 3"}
[[ ## observation_2 ## ]]
5
[[ ## thought_3 ## ]]
What is 4?
[[ ## tool_name_3 ## ]]
evaluate_math
[[ ## tool_args_3 ## ]]
{"expression": "2 + 3"}
[[ ## observation_3 ## ]]
5
[[ ## thought_4 ## ]]
What is 4?
[[ ## tool_name_4 ## ]]
evaluate_math
[[ ## tool_args_4 ## ]]
{"expression": "2 + 3"}
[[ ## observation_4 ## ]]
5
Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## answer ## ]]` (must be formatted as a valid Python float), and then ending with the marker for `[[ ## completed ## ]]`.
[31mResponse:[0m
[32m[[ ## reasoning ## ]]
This is an example of the transitive property of equality, where if we know that 2 + 3 = 5, we can conclude that 4 = 5.
[[ ## answer ## ]]
5.0
[[ ## completed ## ]][0m
Looks like a common problem with base model? Maybe try again with instruct model?
Looks like a common problem with base model? Maybe try again with instruct model?
I tried the query in the ollama REPL and llama 3.2:3b. It worked. When I have time, I can try: different models; tool calling with different prompt techniques (ReACT, COT). If that doesn't work, I'll dive into the code.
Cheers, Andrew