Qwen2.5-Math icon indicating copy to clipboard operation
Qwen2.5-Math copied to clipboard

TIR实验结果异常

Open wangzhihao-coder opened this issue 1 year ago • 6 comments

我参考TIR的prompt在qwen2.5-Math的1.5B和7B模型上进行了实验,得到的指标结果比COT差,我怀疑我的实现缺少了一些步骤,能说明下更详细的实现方式嘛? 我参考下面的prompt实现了TIR

# TIR
messages = [
    {"role": "system", "content": "Please integrate natural language reasoning with programs to solve the problem above, and put your final answer within \\boxed{}."},
    {"role": "user", "content": prompt}
]

wangzhihao-coder avatar Oct 09 '24 02:10 wangzhihao-coder

我也遇到了同样的问题

rangmiao avatar Oct 09 '24 08:10 rangmiao

Hi, same for me, for both 1.5B and 7B. My numbers so far:

Model MATH CoT MATH CoT (maj@8) MATH TIR
Qwen2.5-1.5B-Instruct 0.7646 N/A 0.7534
Qwen2.5-7B-Instruct 0.8408 0.8804 0.8088

The CoT numbers are a bit higher than reported because I used MARIO_EVAL to compare results with ground truth. I'm using vLLM for inference, with top_p=1.0 and temperature=0.


Failure cases

I checked a few failure cases and they seemed strange to me. For example, problem 50 in MATH-test benchmark. The model used a function sympy.complete_square, which clearly does not exist:

<|im_start|>system
Please integrate natural language reasoning with programs to solve the problem above, and put your final answer within \boxed{}.<|im_end|>
<|im_start|>user
Find the radius of the circle with equation $x^2 + 8x + y^2 - 6y = 0$.<|im_end|>
<|im_start|>assistant
To find the radius of the circle given by the equation \(x^2 + 8x + y^2 - 6y = 0\), we need to rewrite the equation in the standard form of a circle's equation \((x - h)^2 + (y - k)^2 = r^2\). This process is called completing the square.

Let's break it down step-by-step:

1. Start with the given equation:
   \[
   x^2 + 8x + y^2 - 6y = 0
   \]

2. Complete the square for the \(x\) terms. The coefficient of \(x\) is 8, so we take half of 8 (which is 4) and square it (which is 16). Add and subtract 16 inside the equation:
   \[
   x^2 + 8x + 16 - 16 + y^2 - 6y = 0
   \]
   This can be rewritten as:
   \[
   (x + 4)^2 - 16 + y^2 - 6y = 0
   \]

3. Next, complete the square for the \(y\) terms. The coefficient of \(y\) is -6, so we take half of -6 (which is -3) and square it (which is 9). Add and subtract 9 inside the equation:
   \[
   (x + 4)^2 - 16 + y^2 - 6y + 9 - 9 = 0
   \]
   This can be rewritten as:
   \[
   (x + 4)^2 + (y - 3)^2 - 25 = 0
   \]

4. Finally, move the constant term to the right side of the equation:
   \[
   (x + 4)^2 + (y - 3)^2 = 25
   \]

Now the equation is in the standard form \((x - h)^2 + (y - k)^2 = r^2\), where \((h, k)\) is the center of the circle and \(r\) is the radius. From the equation \((x + 4)^2 + (y - 3)^2 = 25\), we can see that the center of the circle is \((-4, 3)\) and the radius \(r\) is \(\sqrt{25} = 5\).

So, the radius of the circle is \(5\).

Let's confirm this with Python code.
```python
import sympy as sp

# Define the variables
x, y = sp.symbols('x y')

# Define the equation of the circle
equation = x**2 + 8*x + y**2 - 6*y

# Complete the square for x and y terms
x_term = sp.complete_square(x**2 + 8*x, full=True)[0]
y_term = sp.complete_square(y**2 - 6*y, full=True)[0]

# Rewrite the equation
completed_square_eq = x_term + y_term - sp.complete_square(x**2 + 8*x, full=True)[1] - sp.complete_square(y**2 - 6*y, full=True)[1]

# Simplify the equation
completed_square_eq = sp.simplify(completed_square_eq)

# Extract the radius
radius_squared = completed_square_eq.rhs
radius = sp.sqrt(radius_squared)

print(radius)

Does Qwen team has an in-house version of Sympy that contains functions not existing in the main repo?

hav4ik avatar Oct 20 '24 18:10 hav4ik

Python code could be hallucinated. That explains function that cannot be found?

maybe check this: https://github.com/QwenLM/Qwen-Agent/blob/a9ef165971d37ac37ec1df9e565ca46a38b0afab/examples/tir_math.py

# We use the following two systems to distinguish between COT mode and TIR mode
TIR_SYSTEM = """Please integrate natural language reasoning with programs to solve the problem above, and put your final answer within \\boxed{}."""
COT_SYSTEM = """Please reason step by step, and put your final answer within \\boxed{}."""
            'The dependencies for Python Executor support are not installed. '
            'Please install the required dependencies by running: pip install "qwen-agent[python_executor]"') from e


hengck23 avatar Oct 23 '24 04:10 hengck23

The same problem here. For 7B-instruct, I got 77% on GSM8K with TIR and 95.6% with CoT.

XinXU-USTC avatar Nov 12 '24 08:11 XinXU-USTC

The same problem here. For 7B-instruct, I got 77% on GSM8K with TIR and 95.6% with CoT.

Sorry to disturb you. Did you reproduce the results of Qwen2.5 math base models provided by the paper? I only achieved ~70% acc on Gsm8K dataset, which is largely inconsistent with that in the paper.

1998v7 avatar Jun 19 '25 05:06 1998v7

The same problem here. For 7B-instruct, I got 77% on GSM8K with TIR and 95.6% with CoT.

Sorry to disturb you. Did you reproduce the results of Qwen2.5 math base models provided by the paper? I only achieved ~70% acc on Gsm8K dataset, which is largely inconsistent with that in the paper.

Sorry. I have not evaluated the base models.

XinXU-USTC avatar Jun 23 '25 03:06 XinXU-USTC