goodai-ltm-benchmark issues

Results 12 goodai-ltm-benchmark issues

Sort by recently updated

New ltm score

- Show run duration in detailed report - Compute LTM score from aced tests **DISCLAIMER**: I don't find this way of computing the LTM score satisfying. I am not even...

dcasbol

Re-compute speed metrics from master log

dcasbol

Results and reports of tests in isolation.

dcasbol

Adding the flag `-i` / `--isolated` runs the benchmark's tests independently and without distractors. With two exceptions: 1. Short filler messages are added when a callback is registered, until the...

dcasbol

1k and 32k results

32k results for Claude have some wrong or interesting evaluations: - It fails Names List because it also remembers names from the previous repetition. - In the restaurant task, because...

dcasbol

memgpt interface

JosephDavidsonKSWH

Interface for agents through a FIFO pipe.

dcasbol

Leverage sampling to build a boxplot. Include better estimations of standard deviations.

dcasbol

data for reruns and ltm1 with gpt-4o-mini (missing 200k #2)

Data for: - LTMAgent 1 with gpt-4o-mini (everything) - gpt-4o 32k # 2 - gpt-4-turbo 32k # 2 - gpt-4o 120k # 2

JosephDavidsonKSWH

goodai-ltm-benchmark
goodai-ltm-benchmark copied to clipboard

Metadata

New ltm score

More hf hacks

Re-compute speed metrics from master log

Results and reports of tests in isolation.

Separated tasks

1k and 32k results

memgpt interface

Interface for agents through a FIFO pipe.

Leverage sampling to build a boxplot. Include better estimations of standard deviations.

data for reruns and ltm1 with gpt-4o-mini (missing 200k #2)

← Metadata

Owner

Metadata

goodai-ltm-benchmark goodai-ltm-benchmark copied to clipboard

Metadata

← Metadata

Owner

Metadata

goodai-ltm-benchmark
goodai-ltm-benchmark copied to clipboard