goodai-ltm-benchmark
goodai-ltm-benchmark copied to clipboard
A library for benchmarking the Long Term Memory and Continual learning capabilities of LLM based agents. With all the tests and code you need to evaluate your own agents. See more in the blogpost:
- Show run duration in detailed report - Compute LTM score from aced tests **DISCLAIMER**: I don't find this way of computing the LTM score satisfying. I am not even...
Adding the flag `-i` / `--isolated` runs the benchmark's tests independently and without distractors. With two exceptions: 1. Short filler messages are added when a callback is registered, until the...
32k results for Claude have some wrong or interesting evaluations: - It fails Names List because it also remembers names from the previous repetition. - In the restaurant task, because...
Data for: - LTMAgent 1 with gpt-4o-mini (everything) - gpt-4o 32k # 2 - gpt-4-turbo 32k # 2 - gpt-4o 120k # 2