evals
evals copied to clipboard
Add BigBench Tasks for evaluation
Hi would be cool to valuate all openai models on Beyond the Imitation Game Benchmark (BIG-bench) which is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities. The more than 200 tasks included in BIG-bench are summarized by keyword here, and by task name here. A paper introducing the benchmark, including evaluation results on large language models, is currently under review, and is available as a preprint.
I believe a significant part of BIG-bench requires logprobs, which our API doesn't support currently. However, feel free to open a PR to add BIG-bench evals!