Add evaluation script for memory-augmented models (A-Mem, Mem0, etc.) on LongBench & LongBench v2
Hello, there! I appreciate much for your great work. Below is a potiential improvement for the benchmark to be more universal!
Background
LongBench and LongBench v2 are now standard long-context benchmarks, but the official repo only measures models that read the entire context at once ([GitHub]1, [longbench2.github.io]2). Memory-centric methods such as A-Mem ([arXiv]3) and Mem0 ([mem0.ai]4) process documents incrementally with external memory, so they cannot be fairly compared using the current scripts.
Feature request
Add a built-in evaluation pipeline (e.g. memory_eval.py) that
- Streams each task context in fixed-size chunks to a user-supplied
MemoryWrapper. - Lets the wrapper retrieve/update memories and call the model to answer the query.
- Emits results in the same JSON format accepted by the LongBench leaderboard.
Minimal interface example:
class MemoryWrapper:
def reset(self): ...
def feed(self, chunk: str): ...
def answer(self, query: str) -> str: ...
This would enable direct benchmarking of A-Mem, Mem0 and similar frameworks alongside vanilla long-context LLMs, without extra glue code.
References
Great feature! You’re welcome to submit a PR for it.