Add evaluation script for memory-augmented models (A-Mem, Mem0, etc.) on LongBench & LongBench v2

Open xcczach opened this issue 4 months ago • 1 comments

Hello, there! I appreciate much for your great work. Below is a potiential improvement for the benchmark to be more universal!

Background

LongBench and LongBench v2 are now standard long-context benchmarks, but the official repo only measures models that read the entire context at once ([GitHub]1, [longbench2.github.io]2). Memory-centric methods such as A-Mem ([arXiv]3) and Mem0 ([mem0.ai]4) process documents incrementally with external memory, so they cannot be fairly compared using the current scripts.

Feature request

Add a built-in evaluation pipeline (e.g. memory_eval.py) that

Streams each task context in fixed-size chunks to a user-supplied MemoryWrapper.
Lets the wrapper retrieve/update memories and call the model to answer the query.
Emits results in the same JSON format accepted by the LongBench leaderboard.

Minimal interface example:

class MemoryWrapper:
    def reset(self): ...
    def feed(self, chunk: str): ...
    def answer(self, query: str) -> str: ...

This would enable direct benchmarking of A-Mem, Mem0 and similar frameworks alongside vanilla long-context LLMs, without extra glue code.

References

LongBench repo & docs ([GitHub]5)
LongBench v2 site & paper ([arXiv]6)
A-Mem project ([GitHub]7)
Mem0 performance report ([mem0.ai]8)

Aug 07 '25 05:08 xcczach

Great feature! You’re welcome to submit a PR for it.

Aug 08 '25 03:08 bys0318