Logical-and-abstract-reasoning
Logical-and-abstract-reasoning copied to clipboard
Evaluation on Logical Reasoning and Abstract Reasoning Challenges
Logical and Abstract Reasoning
Repository for the evaluation of Large Language Models on logical and abstract reasoning tasks
Installation
To install the repository, use the following command:
git clone https://github.com/Strong-AI-Lab/Logical-and-abstract-reasoning.git
To install the dependencies in a virtual environment, use the following:
cd Logical-and-abstract-reasoning
python -m venv env/
source env/bin/activate
pip install -r requirements.txt
You may need to install transformers from the repository:
pip install git+https://github.com/huggingface/transformers
Use
Evaluation
To evaluate a model in the repository, use the following command:
python run_evaluation config/model/<model_config.yaml> config/data/<data_config.yaml> --<kwarg_name> <kwarg>
You can choose the model to evaluate by changing the <model_config.yaml> file, and the dataset to evaluate the model on by changing the <data_config.yaml> file. You can add any additional arguments as <kwargs> (e.g. private API key for GPT models).
By default, all the results are saved in a csv file in the logs/ folder. You can re-compute the metrics from the evaluation run from this file by running the following:
python src/evaluate/evaluator.py logs/<results_file.csv>
Fine-tuning
To fine-tune a model on a given dataset, run the following:
python run_finetuning.py config/model/<model_config.yaml> config/data/<data_config.yaml> config/trainer/<trainer_config.yaml>
The configuration files work similarly as for evaluation. The <model_config.yaml> file contains additoinal configuration for training. The logs are saved in fine-tuning-output/ and the model weights are saved in fine-tuning-saves/.
Currently, only HuggingFace models can be fine-tuned.
LLaMA-based model instruction fine-tuning
We use the LLaMA-based model fine-tuning from the Stanford Alpaca training script. If you want to conduct a LLaMA-based model on instruction fine-tuning, you can do that by following this link.
Models
| Inference Type | Model | Size | Task | Link | Remark | |
|---|---|---|---|---|---|---|
| Logical Reasoning on Reading Comprehension | MERIt | - | Reading Comprehension | paper project |
#3 on the ReClor leaderboard | |
| LReasoner | - | Reading Comprehension | paper project |
#6 on the ReClor leaderboard | ||
| AMR-LE | - | Reading Comprehension | project | #2 and #5 on the ReClor leaderboard | ||
| LLaMA | - | Reading Comprehension | paper code |
Open source very large language model | ||
| LLaMA2 | - | Reading Comprehension | paper code |
Open source very large language model | ||
| TinyLLaMA | - | Reading Comprehension | paper code |
Open source very large language model | ||
| Alpaca | - | Reading Comprehension | code | Fine-tuned LLaMA | ||
| Vicuna | - | Reading Comprehension | project code | Fine-tuned LLaMA | ||
| ChatGPT | - | Reading Comprehension | paper project |
Use api to do prompt tuning | ||
| GPT-4 | - | Reading Comprehension | paper project |
Use api to do prompt tuning | ||
| Zephyr-7b-beta | - | Reading Comprehension | code | Fine-tuned Mistral-7b | ||
Datasets & Benchmarks
| Inference Type | Dataset | Size | Task | Link | Remark | |
|---|---|---|---|---|---|---|
| Logical Reasoning on Reading Comprehension | ReClor | - | Reading Comprehension | paper project |
Logical reasoning reading comprehension | |
| LogiQA | - | Reading Comprehension | paper project |
Logical reasoning reading comprehension | ||
| LogiQA V2 | - | Reading Comprehension | project | Logical reasoning reading comprehension | ||
| LogiQA Logical Reasoning Plus | - | Reading Comprehension | project | Logical reasoning reading comprehension for out-of-distribution evaluation | ||
| Abstract Reasoning | ARC | - | Abstract Reasoning | paper code |
Text version of a Visual Abstract Reasoning task | |
| ACRE | - | Abstract Reasoning | paper code |
Text version of a Visual Abstract Reasoning task | ||
| PVR | - | Abstract Reasoning | paper | Abstract Reasoning task | ||
| RAVEN | - | Abstract Reasoning | paper project |
Text version of a Visual Abstract Reasoning task | ||
| Diagrammatic Logic | - | Abstract Reasoning | code | Extracted from OpenAI Evals | ||
| Logic | - | Abstract Reasoning | code | Extracted from OpenAI Evals | ||
| Logic Statements | - | Abstract Reasoning | code | Extracted from OpenAI Evals | ||
| Pattern Identification | - | Abstract Reasoning | code | Extracted from OpenAI Evals | ||
| String Patterns | - | Abstract Reasoning | code | Extracted from OpenAI Evals | ||
| List Functions | - | Abstract Reasoning | code | Extracted from Google BIG-bench | ||
Acknowledgement
Our proposed new dataset logiqa-logical-reasoning-plus has been merged by OpenAI/Evals.