lighteval [EVAL] Big-Bench Extra Hard (BBEH)

Evaluation short description

Google has releases BBEH as a way to compensate for the saturation of BBH in the latest generation of LLMs. Overall looks like a good benchmark to probe reasoning capabilities.

Evaluation metadata

Provide all available

Paper url: https://arxiv.org/pdf/2502.19187
Github url: https://github.com/google-deepmind/bbeh
Dataset url:

Mar 03 '25 15:03 lewtun

I'd like to implement this benchmark, if it's still up! Also, I found this unofficial hub upload of the dataset: https://huggingface.co/datasets/BBEH/bbeh . Since there's no official upload can we use this one, or would it be better to create our own upload similar to the original BBH: https://huggingface.co/datasets/lighteval/bbh ?

Jul 04 '25 17:07 itsmejul

@NathanHB is this still relevant? I would be happy to work on this and add it to lighteval

Nov 25 '25 09:11 jgyasu