lighteval
lighteval copied to clipboard
[EVAL] Big-Bench Extra Hard (BBEH)
Evaluation short description
Google has releases BBEH as a way to compensate for the saturation of BBH in the latest generation of LLMs. Overall looks like a good benchmark to probe reasoning capabilities.
Evaluation metadata
Provide all available
- Paper url: https://arxiv.org/pdf/2502.19187
- Github url: https://github.com/google-deepmind/bbeh
- Dataset url:
I'd like to implement this benchmark, if it's still up! Also, I found this unofficial hub upload of the dataset: https://huggingface.co/datasets/BBEH/bbeh . Since there's no official upload can we use this one, or would it be better to create our own upload similar to the original BBH: https://huggingface.co/datasets/lighteval/bbh ?
@NathanHB is this still relevant? I would be happy to work on this and add it to lighteval