lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

Add tasks for performance on long context lengths

Open nairbv opened this issue 10 months ago • 1 comments

There are a couple of papers I see with benchmarks for really long context lengths that don't seem to be available in lm-evaluation-harness. It would be great to have one of these or something similar for measuring ability to extract information from long context windows, important for RAG.

  • LongBench: https://arxiv.org/abs/2308.14508
  • ∞ Bench: https://arxiv.org/abs/2402.13718
  • Are there any others that might be better?

nairbv avatar Apr 25 '24 14:04 nairbv

Needle-in-a-haystack might also be a nice-to-have though I think more difficult / "natural" long-context evals should be prioritized.

haileyschoelkopf avatar Apr 26 '24 15:04 haileyschoelkopf

any plans for this feature release?

Riskin1999 avatar Aug 16 '24 02:08 Riskin1999