dspy [Bug] New to DSPy. Using LLama 3.2 with ReAct 2 + 2 = 5

What happened?

I am new to DSPy. I have written a small DSPy program. The program calls evaluate 5 times and ends up with the result 5. I don't understand why a tool would be called multiple times and its result re-interpreted.

$ python llama_tool.py 2 + 2 2 + 3 2 + 3 2 + 3 2 + 3 5.0

Cheers, Andrew

Steps to reproduce

from typing import Type

def evaluate_math(expression: str) -> float:
    print(expression)
    return eval(expression)

lm = dspy.LM('ollama_chat/llama3.2:3b', api_base='http://localhost:11434')
dspy.configure(lm=lm)

math_tool = dspy.Tool(
    name="evaluate_math",
    desc="Evaluates a mathematical expression.",
    func=evaluate_math,
    args={
        "expression": {
            "type": "string",
            "description": "Mathematical expression to evaluate"
        }
    }
)

react_module = dspy.ReAct(
    "question -> answer: float",
    tools=[math_tool],
    #max_iters=1
)

# Execute the module with a question that requires the tool
response = react_module(question="What is 2 + 2?")
print(response.answer)  # Expected output: 4.0

DSPy version

2.6.13

Mar 22 '25 17:03 andrewfr

@andrewfr Thanks for reporting the issue! You can quickly check if this is a DSPy issue, prompt issue or LM issue by getting the history:

dspy.inspect_history(n=5)

By running the command above, you should see the prompt and response from LM.

Mar 28 '25 02:03 chenmoneygithub

Thanks for the advice. I think the immediate source of the problem is the inclusion of the question mark "?" . The example works when it is omitted. I would consider this a bug. Perhaps the "?" is being interpreted?

Here is the result of inspect

2 + 2
2 + 3
2 + 3
2 + 3
2 + 3
5.0




[34m[2025-03-28T12:00:56.357706][0m

[31mSystem message:[0m

Your input fields are:
1. `question` (str)
2. `trajectory` (str)

Your output fields are:
1. `next_thought` (str)
2. `next_tool_name` (Literal['evaluate_math', 'finish'])
3. `next_tool_args` (dict[str, Any])

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## trajectory ## ]]
{trajectory}

[[ ## next_thought ## ]]
{next_thought}

[[ ## next_tool_name ## ]]
{next_tool_name}        # note: the value you produce must exactly match (no extra characters) one of: evaluate_math; finish

[[ ## next_tool_args ## ]]
{next_tool_args}        # note: the value you produce must adhere to the JSON schema: {"type": "object"}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Given the fields `question`, produce the fields `answer`.
        
        You will be given `question` and your goal is to finish with `answer`.
        
        To do this, you will interleave Thought, Tool Name, and Tool Args, and receive a resulting Observation.
        
        Thought can reason about the current situation, and Tool Name can be the following types:
        
        (1) evaluate_math, whose description is <desc>Evaluates a mathematical expression.</desc>. It takes arguments {'expression': {'type': 'string', 'description': 'Mathematical expression to evaluate'}} in JSON format.
        (2) finish, whose description is <desc>Signals that the final outputs, i.e. `answer`, are now available and marks the task as complete.</desc>. It takes arguments {'kwargs': 'Any'} in JSON format.


[31mUser message:[0m

[[ ## question ## ]]
What is 2 + 2?

[[ ## trajectory ## ]]
[[ ## thought_0 ## ]]
What is 2 + 2?

[[ ## tool_name_0 ## ]]
evaluate_math

[[ ## tool_args_0 ## ]]
{"expression": "2 + 2"}

[[ ## observation_0 ## ]]
4

Respond with the corresponding output fields, starting with the field `[[ ## next_thought ## ]]`, then `[[ ## next_tool_name ## ]]` (must be formatted as a valid Python Literal['evaluate_math', 'finish']), then `[[ ## next_tool_args ## ]]` (must be formatted as a valid Python dict[str, Any]), and then ending with the marker for `[[ ## completed ## ]]`.


[31mResponse:[0m

[32m[[ ## next_thought ## ]]
What is 4?

[[ ## next_tool_name ## ]]
evaluate_math

[[ ## next_tool_args ## ]]
{"expression": "2 + 3"}

[[ ## completed ## ]][0m





[34m[2025-03-28T12:00:56.364382][0m

[31mSystem message:[0m

Your input fields are:
1. `question` (str)
2. `trajectory` (str)

Your output fields are:
1. `next_thought` (str)
2. `next_tool_name` (Literal['evaluate_math', 'finish'])
3. `next_tool_args` (dict[str, Any])

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## trajectory ## ]]
{trajectory}

[[ ## next_thought ## ]]
{next_thought}

[[ ## next_tool_name ## ]]
{next_tool_name}        # note: the value you produce must exactly match (no extra characters) one of: evaluate_math; finish

[[ ## next_tool_args ## ]]
{next_tool_args}        # note: the value you produce must adhere to the JSON schema: {"type": "object"}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Given the fields `question`, produce the fields `answer`.
        
        You will be given `question` and your goal is to finish with `answer`.
        
        To do this, you will interleave Thought, Tool Name, and Tool Args, and receive a resulting Observation.
        
        Thought can reason about the current situation, and Tool Name can be the following types:
        
        (1) evaluate_math, whose description is <desc>Evaluates a mathematical expression.</desc>. It takes arguments {'expression': {'type': 'string', 'description': 'Mathematical expression to evaluate'}} in JSON format.
        (2) finish, whose description is <desc>Signals that the final outputs, i.e. `answer`, are now available and marks the task as complete.</desc>. It takes arguments {'kwargs': 'Any'} in JSON format.


[31mUser message:[0m

[[ ## question ## ]]
What is 2 + 2?

[[ ## trajectory ## ]]
[[ ## thought_0 ## ]]
What is 2 + 2?

[[ ## tool_name_0 ## ]]
evaluate_math

[[ ## tool_args_0 ## ]]
{"expression": "2 + 2"}

[[ ## observation_0 ## ]]
4

[[ ## thought_1 ## ]]
What is 4?

[[ ## tool_name_1 ## ]]
evaluate_math

[[ ## tool_args_1 ## ]]
{"expression": "2 + 3"}

[[ ## observation_1 ## ]]
5

Respond with the corresponding output fields, starting with the field `[[ ## next_thought ## ]]`, then `[[ ## next_tool_name ## ]]` (must be formatted as a valid Python Literal['evaluate_math', 'finish']), then `[[ ## next_tool_args ## ]]` (must be formatted as a valid Python dict[str, Any]), and then ending with the marker for `[[ ## completed ## ]]`.


[31mResponse:[0m

[32m[[ ## next_thought ## ]]
What is 4?

[[ ## next_tool_name ## ]]
evaluate_math

[[ ## next_tool_args ## ]]
{"expression": "2 + 3"}

[[ ## completed ## ]][0m





[34m[2025-03-28T12:00:56.372805][0m

[31mSystem message:[0m

Your input fields are:
1. `question` (str)
2. `trajectory` (str)

Your output fields are:
1. `next_thought` (str)
2. `next_tool_name` (Literal['evaluate_math', 'finish'])
3. `next_tool_args` (dict[str, Any])

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## trajectory ## ]]
{trajectory}

[[ ## next_thought ## ]]
{next_thought}

[[ ## next_tool_name ## ]]
{next_tool_name}        # note: the value you produce must exactly match (no extra characters) one of: evaluate_math; finish

[[ ## next_tool_args ## ]]
{next_tool_args}        # note: the value you produce must adhere to the JSON schema: {"type": "object"}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Given the fields `question`, produce the fields `answer`.
        
        You will be given `question` and your goal is to finish with `answer`.
        
        To do this, you will interleave Thought, Tool Name, and Tool Args, and receive a resulting Observation.
        
        Thought can reason about the current situation, and Tool Name can be the following types:
        
        (1) evaluate_math, whose description is <desc>Evaluates a mathematical expression.</desc>. It takes arguments {'expression': {'type': 'string', 'description': 'Mathematical expression to evaluate'}} in JSON format.
        (2) finish, whose description is <desc>Signals that the final outputs, i.e. `answer`, are now available and marks the task as complete.</desc>. It takes arguments {'kwargs': 'Any'} in JSON format.


[31mUser message:[0m

[[ ## question ## ]]
What is 2 + 2?

[[ ## trajectory ## ]]
[[ ## thought_0 ## ]]
What is 2 + 2?

[[ ## tool_name_0 ## ]]
evaluate_math

[[ ## tool_args_0 ## ]]
{"expression": "2 + 2"}

[[ ## observation_0 ## ]]
4

[[ ## thought_1 ## ]]
What is 4?

[[ ## tool_name_1 ## ]]
evaluate_math

[[ ## tool_args_1 ## ]]
{"expression": "2 + 3"}

[[ ## observation_1 ## ]]
5

[[ ## thought_2 ## ]]
What is 4?

[[ ## tool_name_2 ## ]]
evaluate_math

[[ ## tool_args_2 ## ]]
{"expression": "2 + 3"}

[[ ## observation_2 ## ]]
5

Respond with the corresponding output fields, starting with the field `[[ ## next_thought ## ]]`, then `[[ ## next_tool_name ## ]]` (must be formatted as a valid Python Literal['evaluate_math', 'finish']), then `[[ ## next_tool_args ## ]]` (must be formatted as a valid Python dict[str, Any]), and then ending with the marker for `[[ ## completed ## ]]`.


[31mResponse:[0m

[32m[[ ## next_thought ## ]]
What is 4?

[[ ## next_tool_name ## ]]
evaluate_math

[[ ## next_tool_args ## ]]
{"expression": "2 + 3"}

[[ ## completed ## ]][0m





[34m[2025-03-28T12:00:56.383920][0m

[31mSystem message:[0m

Your input fields are:
1. `question` (str)
2. `trajectory` (str)

Your output fields are:
1. `next_thought` (str)
2. `next_tool_name` (Literal['evaluate_math', 'finish'])
3. `next_tool_args` (dict[str, Any])

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## trajectory ## ]]
{trajectory}

[[ ## next_thought ## ]]
{next_thought}

[[ ## next_tool_name ## ]]
{next_tool_name}        # note: the value you produce must exactly match (no extra characters) one of: evaluate_math; finish

[[ ## next_tool_args ## ]]
{next_tool_args}        # note: the value you produce must adhere to the JSON schema: {"type": "object"}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Given the fields `question`, produce the fields `answer`.
        
        You will be given `question` and your goal is to finish with `answer`.
        
        To do this, you will interleave Thought, Tool Name, and Tool Args, and receive a resulting Observation.
        
        Thought can reason about the current situation, and Tool Name can be the following types:
        
        (1) evaluate_math, whose description is <desc>Evaluates a mathematical expression.</desc>. It takes arguments {'expression': {'type': 'string', 'description': 'Mathematical expression to evaluate'}} in JSON format.
        (2) finish, whose description is <desc>Signals that the final outputs, i.e. `answer`, are now available and marks the task as complete.</desc>. It takes arguments {'kwargs': 'Any'} in JSON format.


[31mUser message:[0m

[[ ## question ## ]]
What is 2 + 2?

[[ ## trajectory ## ]]
[[ ## thought_0 ## ]]
What is 2 + 2?

[[ ## tool_name_0 ## ]]
evaluate_math

[[ ## tool_args_0 ## ]]
{"expression": "2 + 2"}

[[ ## observation_0 ## ]]
4

[[ ## thought_1 ## ]]
What is 4?

[[ ## tool_name_1 ## ]]
evaluate_math

[[ ## tool_args_1 ## ]]
{"expression": "2 + 3"}

[[ ## observation_1 ## ]]
5

[[ ## thought_2 ## ]]
What is 4?

[[ ## tool_name_2 ## ]]
evaluate_math

[[ ## tool_args_2 ## ]]
{"expression": "2 + 3"}

[[ ## observation_2 ## ]]
5

[[ ## thought_3 ## ]]
What is 4?

[[ ## tool_name_3 ## ]]
evaluate_math

[[ ## tool_args_3 ## ]]
{"expression": "2 + 3"}

[[ ## observation_3 ## ]]
5

Respond with the corresponding output fields, starting with the field `[[ ## next_thought ## ]]`, then `[[ ## next_tool_name ## ]]` (must be formatted as a valid Python Literal['evaluate_math', 'finish']), then `[[ ## next_tool_args ## ]]` (must be formatted as a valid Python dict[str, Any]), and then ending with the marker for `[[ ## completed ## ]]`.


[31mResponse:[0m

[32m[[ ## next_thought ## ]]
What is 4?

[[ ## next_tool_name ## ]]
evaluate_math

[[ ## next_tool_args ## ]]
{"expression": "2 + 3"}

[[ ## completed ## ]][0m





[34m[2025-03-28T12:00:56.396033][0m

[31mSystem message:[0m

Your input fields are:
1. `question` (str)
2. `trajectory` (str)

Your output fields are:
1. `reasoning` (str)
2. `answer` (float)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## trajectory ## ]]
{trajectory}

[[ ## reasoning ## ]]
{reasoning}

[[ ## answer ## ]]
{answer}        # note: the value you produce must be a single float value

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Given the fields `question`, produce the fields `answer`.


[31mUser message:[0m

[[ ## question ## ]]
What is 2 + 2?

[[ ## trajectory ## ]]
[[ ## thought_0 ## ]]
What is 2 + 2?

[[ ## tool_name_0 ## ]]
evaluate_math

[[ ## tool_args_0 ## ]]
{"expression": "2 + 2"}

[[ ## observation_0 ## ]]
4

[[ ## thought_1 ## ]]
What is 4?

[[ ## tool_name_1 ## ]]
evaluate_math

[[ ## tool_args_1 ## ]]
{"expression": "2 + 3"}

[[ ## observation_1 ## ]]
5

[[ ## thought_2 ## ]]
What is 4?

[[ ## tool_name_2 ## ]]
evaluate_math

[[ ## tool_args_2 ## ]]
{"expression": "2 + 3"}

[[ ## observation_2 ## ]]
5

[[ ## thought_3 ## ]]
What is 4?

[[ ## tool_name_3 ## ]]
evaluate_math

[[ ## tool_args_3 ## ]]
{"expression": "2 + 3"}

[[ ## observation_3 ## ]]
5

[[ ## thought_4 ## ]]
What is 4?

[[ ## tool_name_4 ## ]]
evaluate_math

[[ ## tool_args_4 ## ]]
{"expression": "2 + 3"}

[[ ## observation_4 ## ]]
5

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## answer ## ]]` (must be formatted as a valid Python float), and then ending with the marker for `[[ ## completed ## ]]`.


[31mResponse:[0m

[32m[[ ## reasoning ## ]]
This is an example of the transitive property of equality, where if we know that 2 + 3 = 5, we can conclude that 4 = 5.

[[ ## answer ## ]]
5.0

[[ ## completed ## ]][0m

Mar 28 '25 16:03 andrewfr

Looks like a common problem with base model? Maybe try again with instruct model?

Apr 01 '25 04:04 hxy9243

Looks like a common problem with base model? Maybe try again with instruct model?

I tried the query in the ollama REPL and llama 3.2:3b. It worked. When I have time, I can try: different models; tool calling with different prompt techniques (ReACT, COT). If that doesn't work, I'll dive into the code.

Cheers, Andrew

Apr 04 '25 16:04 andrewfr