Support Logic Reasoning Benchmark
This PR provides a draft evaluation for two common logic reasoning benchmarks (ProntoQA, ProofWriter) which test the ability of deduction reasoning, i.e., given a set of facts and rules to judge the correctness of a query. Solving this task requires excellent abilities of parsing natural language into prover-specific symbolic language, and calling a external prover to solve the problem.
To ease the evaluation, symbolic language is provided together with the dataset. So the only task for the agent is to correctly call the prover (pyke, a python package).
The current draft is preliminary. The integration process has not been completed yet. I am still working on the Instruction part (how to let the agent know that he should use a python package? do i just tell him directly or should i write the python code for him? thx if anyone can show me an example).
I am still working on the Instruction part (how to let the agent know that he should use a python package? do i just tell him directly or should i write the python code for him? thx if anyone can show me an example).
Do you mean add this into cmd?
Just pushed a quick-and-dirty version, still a bit buggy :( . The biggest obstacle is how can let the agent call a custom defined python function to help solve the task. I copy the python file to the workspace_mount_path and tell the agent to use the code in this file. Does this logic make sense? Thanks!
This is a good question, maybe @xingyaoww can give some feedback.
@Ren-Ma Yes! I think temporarily that should work (if we are assuming number of process = 1) - Before the task starts, you clean up the workspace, put the relevant code into workspace, then ask the agent to look at /workspace and begin working!
Let us know when the script is runnable (e.g., input an instruction and it output a result) -- We can help making this more streamlined!
Let us know when the script is runnable (e.g., input an instruction and it output a result) -- We can help making this more streamlined!
thank god it finally works !! I just tested on the first example of ProntoQA dataset, see the action track in the README.
I run it locally via
gpt-3.5-turbo. It can run successfully. But maybe 3.5 is not powerful, it output the anwser but not in the format you specify and will stuck. Then I trygpt-4-1106-preview, it will not stuck, but the output format seems also not what you want.I see you have a function
get_test_resultto get the answer, but I worry it can not correctly deal with different output likeThe answer to the logic query is 'B'. If you have any further questions or tasks, feel free to ask!In this case, your
get_test_resultfunction cannot extract the correct resultB.But I find the gpt will run the program your provide and get the final choice:
# Execute the logic programs answer, flag, error_message = engine.safe_execute_program(logic_programs) answer 02:33:18 - OBSERVATION **IPythonRunCellObservation** 'B'I think maybe we can check this observation result to better parse the final choice? If I my understand was wrong, pls correct me.
The following is the log from
gpt-4-1106-preview:============== STEP 1 02:33:17 - opendevin:INFO: llm.py:225 - Cost: 0.06 USD | Accumulated Cost: 0.10 USD 02:33:17 - ACTION **IPythonRunCellAction** THOUGHT: Great, now that the package is installed, I will proceed with the logic reasoning task. I will add the path to the code and instantiate the `LogicInferenceEngine` class, then use the `safe_execute_program` method to prove the logic programs. CODE: import sys sys.path.append('/Users/yufansong/code/OpenDevin/workspace/_eval_workspace/13501') from logic_inference import LogicInferenceEngine # Define the dataset name and logic programs dataset_name = "ProntoQA" logic_programs = """ xxxxxxx .... some log xxxxxxx # Instantiate the LogicInferenceEngine workspace_mount_path = '/Users/yufansong/code/OpenDevin/workspace/_eval_workspace/13501' engine = LogicInferenceEngine(dataset_name, workspace_mount_path) # Execute the logic programs answer, flag, error_message = engine.safe_execute_program(logic_programs) answer 02:33:18 - OBSERVATION **IPythonRunCellObservation** 'B' ============== STEP 2 02:33:22 - opendevin:INFO: llm.py:225 - Cost: 0.05 USD | Accumulated Cost: 0.15 USD 02:33:22 - ACTION **MessageAction** (source=None) CONTENT: The answer to the logic query is 'B'. If you have any further questions or tasks, feel free to ask! 02:33:22 - ACTION **MessageAction** (source=EventSource.AGENT) CONTENT: The answer to the logic query is 'B'. If you have any further questions or tasks, feel free to ask! 02:33:22 - opendevin:INFO: agent_controller.py:160 - Setting agent(CodeActAgent) state from AgentState.RUNNING to AgentState.AWAITING_USER_INPUT 02:33:22 - OBSERVATION **MessageAction** (source=EventSource.USER) CONTENT: Please continue working on the task on whatever approach you think is suitable. If you think you have solved the task, please run the following command: <execute_bash> exit </execute_bash>. IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP OR USE THE INTERNET TO SOLVE THIS TASK. 02:33:22 - opendevin:INFO: agent_controller.py:160 - Setting agent(CodeActAgent) state from AgentState.AWAITING_USER_INPUT to AgentState.RUNNING
pushed a little bit edits! Now the answer should be correctly parsef from the message in the state.history.
I'm not sure if I'm understanding correctly, but this implementation seems to be a bit different from the original ProntoQA. Here we feed the program in advance to the agent and just make them execute it and obtain the result. Is that correct? Cuz I'm afraid it may be too easy for many models 🤔
I'm not sure if I'm understanding correctly, but this implementation seems to be a bit different from the original
ProntoQA.
I have not read the orignal paper. Could you tell me the difference between original ProntoQA and this impelementation?
Here we feed the program in advance to the agent and just make them execute it and obtain the result. Is that correct? Cuz I'm afraid it may be too easy for many models 🤔
Do you have any other idea?
I have not read the original paper. Could you tell me the difference between
original ProntoQAand this implementation?
From my understanding, the original implementation gives the model the rules, and facts via natural language context, and from the query the model has to derive the series of proof steps to come up with the answer for the query. Regarding this implementation, I think the author is trying to convert the context and query to code for the agent to execute and obtain the final result.
Do you have any other idea?
I'm not sure, but thinking about not feeding the raw program to the agent and instead letting it formulate the program by itself.
I have not read the original paper. Could you tell me the difference between
original ProntoQAand this implementation?From my understanding, the original implementation gives the model the rules, and facts via natural language context, and from the query the model has to derive the series of proof steps to come up with the answer for the query. Regarding this implementation, I think the author is trying to convert the context and query to code for the agent to execute and obtain the final result.
Do you have any other idea?
I'm not sure, but thinking about not feeding the raw program to the agent and instead letting it formulate the program by itself.
You are definitely right. The raw ProntoQA dataset does not provide any symbolic language expression or corresponding programs. The logic of a neuro-symbolic method is to 1) parse logic reasoning problem in natural language to corresponding progams, and 2) feed the programs into an inference engine (pyke in this case) to get the answer. Here i skipped the first step and just let the agent do the job in the second step, because I assumed in OpenDevin's benchmark the core ability we want to test is the how to interact with the local environment to get the inference engine (please correct me if i am wrong). The first step is actually a semantic parsing task. Based on my personal experence, the correctness of this semantic parsing task is terrible for SOTA models (GPT4/Claude/Llamma3...). If we also include the first step, the final preformance of OpenDiven on this benchmark will be heavily influenced by the semantic parsing correctness.
You are definitely right. The raw ProntoQA dataset does not provide any symbolic language expression or corresponding programs. The logic of a neuro-symbolic method is to 1) parse logic reasoning problem in natural language to corresponding progams, and 2) feed the programs into an inference engine (pyke in this case) to get the answer. Here i skipped the first step and just let the agent do the job in the second step, because I assumed in OpenDevin's benchmark the core ability we want to test is the how to interact with the local environment to get the inference engine (please correct me if i am wrong). The first step is actually a semantic parsing task. Based on my personal experence, the correctness of this semantic parsing task is terrible for SOTA models (GPT4/Claude/Llamma3...). If we also include the first step, the final preformance of OpenDiven on this benchmark will be heavily influenced by the semantic parsing correctness.
Yeah, it seems reasonable to me, thanks for the explanation.
@Ren-Ma btw, can you let the run_infer.py or run_infer.sh output some final result like accurate rate? It will be convinient for us when running your benchmark.
@Ren-Ma btw, can you let the
run_infer.pyorrun_infer.shoutput some final result likeaccurate rate? It will be convinient for us when running your benchmark.
Done! Now we can quickly get the accuracy from the metadata.json. See README.md.
LGTM. Hope some can also take a look before we merge it.