OpenHands icon indicating copy to clipboard operation
OpenHands copied to clipboard

Support MINT benchmark (MATH, GSM8K subset)

Open ryanhoangt opened this issue 1 year ago • 1 comments

This PR provides a draft evaluation integration for the MINT benchmark which tests the agent's ability to solve tasks with multi-turn interactions. This benchmark tests the agent's ability of code generation, decision-making, and reasoning. I'm working on the MATH and GSM8K subsets.

The original repo is at here.

The current draft is preliminary, and the integration process is not done yet.

ryanhoangt avatar May 21 '24 18:05 ryanhoangt

The evaluation for MATH subset now can be run using bash ./evaluation/mint/run_infer.sh.

The result looks like below:

{
  "id": 0,
  "instance": {
    "task_name": "reasoning",
    "task_id": 0,
    "prompt": "What is the area of the region in the $xy-$plane that satisfies \\[\\lfloor x \\rfloor \\lfloor y \\rfloor = 16\\, ?\\]",
    "reference": "10.0",
    "metadata": {}
  },
  "instruction": "You are a helpful assistant assigned with the task of problem-solving.\nTo solve the task, you can only interact with the interactive Python (Jupyter Notebook) environment using <execute_ipython> tag. Other tools cannot be used.\nAt each turn, you should first provide your step-by-step thinking for solving the task. Your thought process should be enclosed using \"<thought>\" tag, for example: <thought> I need to print \"Hello World!\" </thought>.\n\nAfter that, you have two options:\n1) Interact with a Python programming environment and receive the corresponding output.\n2) Directly provide a solution by sending your answer to user through message that adheres to the required format for the given task. Your solution should be enclosed using \"<solution>\" tag, for example: The answer is <solution> A </solution>.\nEither you choose to interact with the Python environment or provide a solution, you need to send a message to the user to evaluate your response and provide feedback.\n\nYou have 5 chances to interact with the environment or propose a solution. You can only propose a solution 2 times.\n\n---\n\nTask:\nEvaluate $\\int\\int_C (x+y)^2 dx,dy$, where C is the circle defined by $x^2 + y^2 = M$, and M is the birth year of Li Bai.\nOutput format requirement: Round to three decimal places, using half up rounding rule.\n\nAssistant:\n<thought> To solve this problem, let's start by determining Li Bai's birth year. </thought>\n<execute> print(wikipedia_search('Li Bai')) </execute>\n\nObservation:\nPage: Li Bai\nSummary: Li Bai (Chinese: \u674e\u767d; pinyin: L\u01d0 B\u00e1i, 701\u2013762), also pronounced as Li Bo, courtesy name Taibai (Chinese: \u592a\u767d), was a Chinese poet, acclaimed from his own time to the present as a brilliant and romantic figure who took traditional poetic forms to new heights.\n\nAssistant:\n<thought>\nFrom the wiki_search result, we can see that Li Bai was born in 701, thus M = 701.\nTo evaluate the double integral of $(x+y)^2$ over the region $C$, where $C$ is the circle defined by $x^2 + y^2 = 701$, we should use polar coordinates.\nIn polar coordinates, $x = r\\cos(\\theta)$ and $y = r\\sin(\\theta)$, and the differential area element $dx,dy$ becomes $r,dr,d\\theta$.\n$\\int\\int_C (x+y)^2 dx,dy = \\int_0^{2\\pi}\\int_0^{\\sqrt{701}} (r\\cos(\\theta) + r\\sin(\\theta))^2r,dr,d\\theta$\nLet's use the SymPy package to calculate this.\n</thought>\n<execute>\nfrom sympy import symbols, integrate, cos, sin, pi\n\nr, theta = symbols('r theta')\n\n# the integrand\nintegrand = (r*cos(theta) + r*sin(theta))**2 * r\n\n# perform the double integration\ndouble_integral = integrate(integrate(integrand, (r, 0, 701 ** 0.5)), (theta, 0, 2*pi))\nprint(double_integral)\n</execute>\n\nObservation:\nOut[2]: 245700.5*pi\n\nAssistant:\n<thought> Let's round this result to three decimal places using half up rounding rule. </thought>\n<execute>\nimport math\nprint(round(245700.5*math.pi, 3))\n</execute>\n\nObservation:\n771890.886\n\nAssistant:\nThe answer is <solution> 771890.886 </solution>.\n\n\n---\n\n# Problem statement:\nTask:\nWhat is the area of the region in the $xy-$plane that satisfies \\[\\lfloor x \\rfloor \\lfloor y \\rfloor = 16\\, ?\\]\nIMPORTANT: You should ONLY interact with the environment provided to you or provide the solution inside <solution> tag AND NEVER ASK FOR HUMAN HELP.\n",
  "metadata": {
    "agent_class": "CodeActAgent",
    "model_name": "gpt-4-1106-preview",
    "max_iterations": 5,
    "max_propose_solution": 2,
    "eval_output_dir": "evaluation/evaluation_outputs/outputs/mint/CodeActAgent/gpt-4-1106-preview_maxiter_5",
    "start_time": "2024-05-24 22:45:09",
    "git_commit": "6aaae4ce1797bee7f1e76aa399e390ffa1442050"
  },
  "history": [...],
  "error": "Agent reached maximum number of iterations",
  "test_result": false
}

Pending improvements:

  1. Integrate other subsets similarly.
  2. Prompt tuning to maximize performance.
  3. Robust error handling.

@xingyaoww can you help me review it?

ryanhoangt avatar May 24 '24 16:05 ryanhoangt

I tested locally with the 2 first examples and both passed now. Can you try again? @yufansong

ryanhoangt avatar May 25 '24 13:05 ryanhoangt

I tested locally with the 2 first examples and both passed now. Can you try again? @yufansong

Ok, this time it works in my local. It output some solutions.

yufansong avatar May 25 '24 14:05 yufansong