cobbler
cobbler copied to clipboard
Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
Benchmarking Cognitive Biases in Large Language Models as Evaluators
- Your directory structure should now look like this where both repositories should be on the same level
Working directory
└───competitive-llms
└───talkative-llms
- cd into
competitive-llms
now and install requirements
pip install -r requirements.txt
- In each file, there are various
sys.append
that you should specify your home directory to the path wherecompetitive-llms
is located.
Now everything should be runnable.
CoBBLEr: Cognitive Bias Benchmark for LLMs as Evaluators
To replicate the results you can utilize the provided aggregated responses in the n15_responses
folder.
Adding your own model
To evaluate your own language model, you can add a config for your model in the configs
folder under the competitive-llms
directory.
To benchmark your model on each bias module:
-
Add your model to
evaluations/model_configs.py
for the path to your models config file -
Add your models to the list of evaluators array in
evaluate.py
-
To run each script you can run the script from the
competitive-llms
directory as:
python3 evaluations/evaluate.py 1 order
which runs the first batch of the list of models defined on the order benchmark.