OpenHands icon indicating copy to clipboard operation
OpenHands copied to clipboard

Support Logic Reasoning Benchmark

Open Ren-Ma opened this issue 1 year ago • 5 comments

This PR provides a draft evaluation for two common logic reasoning benchmarks (ProntoQA, ProofWriter) which test the ability of deduction reasoning, i.e., given a set of facts and rules to judge the correctness of a query. Solving this task requires excellent abilities of parsing natural language into prover-specific symbolic language, and calling a external prover to solve the problem.

To ease the evaluation, symbolic language is provided together with the dataset. So the only task for the agent is to correctly call the prover (pyke, a python package).

The current draft is preliminary. The integration process has not been completed yet. I am still working on the Instruction part (how to let the agent know that he should use a python package? do i just tell him directly or should i write the python code for him? thx if anyone can show me an example).

Ren-Ma avatar May 22 '24 14:05 Ren-Ma

I am still working on the Instruction part (how to let the agent know that he should use a python package? do i just tell him directly or should i write the python code for him? thx if anyone can show me an example).

Do you mean add this into cmd?

yufansong avatar May 22 '24 17:05 yufansong

Just pushed a quick-and-dirty version, still a bit buggy :( . The biggest obstacle is how can let the agent call a custom defined python function to help solve the task. I copy the python file to the workspace_mount_path and tell the agent to use the code in this file. Does this logic make sense? Thanks!

Ren-Ma avatar May 23 '24 15:05 Ren-Ma

This is a good question, maybe @xingyaoww can give some feedback.

neubig avatar May 23 '24 16:05 neubig

@Ren-Ma Yes! I think temporarily that should work (if we are assuming number of process = 1) - Before the task starts, you clean up the workspace, put the relevant code into workspace, then ask the agent to look at /workspace and begin working!

xingyaoww avatar May 23 '24 16:05 xingyaoww

Let us know when the script is runnable (e.g., input an instruction and it output a result) -- We can help making this more streamlined!

xingyaoww avatar May 23 '24 16:05 xingyaoww

Let us know when the script is runnable (e.g., input an instruction and it output a result) -- We can help making this more streamlined!

thank god it finally works !! I just tested on the first example of ProntoQA dataset, see the action track in the README.

Ren-Ma avatar May 26 '24 14:05 Ren-Ma

I run it locally via gpt-3.5-turbo. It can run successfully. But maybe 3.5 is not powerful, it output the anwser but not in the format you specify and will stuck. Then I try gpt-4-1106-preview, it will not stuck, but the output format seems also not what you want.

I see you have a function get_test_result to get the answer, but I worry it can not correctly deal with different output like

The answer to the logic query is 'B'. If you have any further questions or tasks, feel free to ask!

In this case, your get_test_result function cannot extract the correct result B.

But I find the gpt will run the program your provide and get the final choice:

# Execute the logic programs
answer, flag, error_message = engine.safe_execute_program(logic_programs)
answer
02:33:18 - OBSERVATION
**IPythonRunCellObservation**
'B'

I think maybe we can check this observation result to better parse the final choice? If I my understand was wrong, pls correct me.

The following is the log from gpt-4-1106-preview:


==============
STEP 1

02:33:17 - opendevin:INFO: llm.py:225 - Cost: 0.06 USD | Accumulated Cost: 0.10 USD
02:33:17 - ACTION
**IPythonRunCellAction**
THOUGHT: Great, now that the package is installed, I will proceed with the logic reasoning task. I will add the path to the code and instantiate the `LogicInferenceEngine` class, then use the `safe_execute_program` method to prove the logic programs.
CODE:
import sys
sys.path.append('/Users/yufansong/code/OpenDevin/workspace/_eval_workspace/13501')

from logic_inference import LogicInferenceEngine

# Define the dataset name and logic programs
dataset_name = "ProntoQA"
logic_programs = """

xxxxxxx
.... some log
xxxxxxx

# Instantiate the LogicInferenceEngine
workspace_mount_path = '/Users/yufansong/code/OpenDevin/workspace/_eval_workspace/13501'
engine = LogicInferenceEngine(dataset_name, workspace_mount_path)

# Execute the logic programs
answer, flag, error_message = engine.safe_execute_program(logic_programs)
answer
02:33:18 - OBSERVATION
**IPythonRunCellObservation**
'B'


==============
STEP 2

02:33:22 - opendevin:INFO: llm.py:225 - Cost: 0.05 USD | Accumulated Cost: 0.15 USD
02:33:22 - ACTION
**MessageAction** (source=None)
CONTENT: The answer to the logic query is 'B'. If you have any further questions or tasks, feel free to ask!
02:33:22 - ACTION
**MessageAction** (source=EventSource.AGENT)
CONTENT: The answer to the logic query is 'B'. If you have any further questions or tasks, feel free to ask!
02:33:22 - opendevin:INFO: agent_controller.py:160 - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.AWAITING_USER_INPUT
02:33:22 - OBSERVATION
**MessageAction** (source=EventSource.USER)
CONTENT: Please continue working on the task on whatever approach you think is suitable.
If you think you have solved the task, please run the following command: <execute_bash> exit </execute_bash>.
IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP OR USE THE INTERNET TO SOLVE THIS TASK.

02:33:22 - opendevin:INFO: agent_controller.py:160 - Setting agent(CodeActAgent) state from AgentState.AWAITING_USER_INPUT to AgentState.RUNNING

pushed a little bit edits! Now the answer should be correctly parsef from the message in the state.history.

Ren-Ma avatar May 27 '24 09:05 Ren-Ma

I'm not sure if I'm understanding correctly, but this implementation seems to be a bit different from the original ProntoQA. Here we feed the program in advance to the agent and just make them execute it and obtain the result. Is that correct? Cuz I'm afraid it may be too easy for many models 🤔

ryanhoangt avatar May 28 '24 08:05 ryanhoangt

I'm not sure if I'm understanding correctly, but this implementation seems to be a bit different from the original ProntoQA.

I have not read the orignal paper. Could you tell me the difference between original ProntoQA and this impelementation?

Here we feed the program in advance to the agent and just make them execute it and obtain the result. Is that correct? Cuz I'm afraid it may be too easy for many models 🤔

Do you have any other idea?

yufansong avatar May 28 '24 16:05 yufansong

I have not read the original paper. Could you tell me the difference between original ProntoQA and this implementation?

From my understanding, the original implementation gives the model the rules, and facts via natural language context, and from the query the model has to derive the series of proof steps to come up with the answer for the query. Regarding this implementation, I think the author is trying to convert the context and query to code for the agent to execute and obtain the final result.

Do you have any other idea?

I'm not sure, but thinking about not feeding the raw program to the agent and instead letting it formulate the program by itself.

ryanhoangt avatar May 29 '24 02:05 ryanhoangt

I have not read the original paper. Could you tell me the difference between original ProntoQA and this implementation?

From my understanding, the original implementation gives the model the rules, and facts via natural language context, and from the query the model has to derive the series of proof steps to come up with the answer for the query. Regarding this implementation, I think the author is trying to convert the context and query to code for the agent to execute and obtain the final result.

Do you have any other idea?

I'm not sure, but thinking about not feeding the raw program to the agent and instead letting it formulate the program by itself.

You are definitely right. The raw ProntoQA dataset does not provide any symbolic language expression or corresponding programs. The logic of a neuro-symbolic method is to 1) parse logic reasoning problem in natural language to corresponding progams, and 2) feed the programs into an inference engine (pyke in this case) to get the answer. Here i skipped the first step and just let the agent do the job in the second step, because I assumed in OpenDevin's benchmark the core ability we want to test is the how to interact with the local environment to get the inference engine (please correct me if i am wrong). The first step is actually a semantic parsing task. Based on my personal experence, the correctness of this semantic parsing task is terrible for SOTA models (GPT4/Claude/Llamma3...). If we also include the first step, the final preformance of OpenDiven on this benchmark will be heavily influenced by the semantic parsing correctness.

Ren-Ma avatar May 29 '24 03:05 Ren-Ma

You are definitely right. The raw ProntoQA dataset does not provide any symbolic language expression or corresponding programs. The logic of a neuro-symbolic method is to 1) parse logic reasoning problem in natural language to corresponding progams, and 2) feed the programs into an inference engine (pyke in this case) to get the answer. Here i skipped the first step and just let the agent do the job in the second step, because I assumed in OpenDevin's benchmark the core ability we want to test is the how to interact with the local environment to get the inference engine (please correct me if i am wrong). The first step is actually a semantic parsing task. Based on my personal experence, the correctness of this semantic parsing task is terrible for SOTA models (GPT4/Claude/Llamma3...). If we also include the first step, the final preformance of OpenDiven on this benchmark will be heavily influenced by the semantic parsing correctness.

Yeah, it seems reasonable to me, thanks for the explanation.

ryanhoangt avatar May 29 '24 04:05 ryanhoangt

@Ren-Ma btw, can you let the run_infer.py or run_infer.sh output some final result like accurate rate? It will be convinient for us when running your benchmark.

yufansong avatar May 29 '24 05:05 yufansong

@Ren-Ma btw, can you let the run_infer.py or run_infer.sh output some final result like accurate rate? It will be convinient for us when running your benchmark.

Done! Now we can quickly get the accuracy from the metadata.json. See README.md.

Ren-Ma avatar May 29 '24 09:05 Ren-Ma

LGTM. Hope some can also take a look before we merge it.

yufansong avatar May 29 '24 15:05 yufansong