mem0 icon indicating copy to clipboard operation
mem0 copied to clipboard

Unable to reproduce locomo eval scores locally

Open NITHISHM2410 opened this issue 7 months ago • 12 comments

Hi, I have been trying to reproduce locomo metric scores. I see that for evaluation you have used mem0 platform instead of mem0.memory.main.Memory. To evaluate without using the mem0 platform, I replaced the all mem0.client.main.MemoryClient methods like add and search in evaluation/src/memzero/add.py and evaluation/src/memzero/search.py with mem0.memory.main.Memory's add and search method. I also replaced the MemoryClient module with the Memory module.

By doing this, I'm able to add and search memories from locomo dataset and I'm also able to get responses for the questions in locomo dataset as well.

My issue is the scores I get in my evaluation is significantly lower than the ones I see in the paper. Can you help me find what's the issue with my code that results in such low performance?

You can find my modified code here

Commands: 1.) To add memories and answer questions: python -m evaluation.run_experiments --technique_type mem0 2.) To evaluate: python -m evaluation.evals --input_file results.json 3.) To get scores: python -m evaluation.generate_scores

Thanks a lot !

NITHISHM2410 avatar May 26 '25 09:05 NITHISHM2410

Hi @NITHISHM2410 on the platform we have made some improvements in terms of addition and search, which is why the scores are lower when you use the open-source mem0.

To be specific, on the platform, we use:

  1. Contextual ADD: https://docs.mem0.ai/platform/features/contextual-add
  2. Custom Instructions: https://docs.mem0.ai/platform/features/custom-instructions

prateekchhikara avatar Jun 09 '25 18:06 prateekchhikara

@prateekchhikara Thank you for your explanation, but in conjunction with the paper and benchmark, we are unable to reproduce the accuracy you mentioned. Do you plan to release verifiable accurate information so that those in the community can reproduce the significant improvements you claim?

dohooo avatar Jun 15 '25 08:06 dohooo

@prateekchhikara Thank you for your explanation, but in conjunction with the paper and benchmark, we are unable to reproduce the accuracy you mentioned. Do you plan to release verifiable accurate information so that those in the community can reproduce the significant improvements you claim?

+1

keranlee avatar Jun 23 '25 03:06 keranlee

b.t.w, about the experiment setup in the paper it says: "In our experimental evaluation, we configured the system with ‘m’ = 10 previous messages for contextual reference and ‘s’ = 10 similar memories for comparative analysis. All language model operations utilized GPT-4o-mini as the inference engine. The vector database employs dense embeddings to facilitate efficient similarity search during the update phase."

and in the setup in evaluation scripts, seems m(batch_size) is set to 2 and s(topk) is set to 30

how to repro the metric in the paper

keranlee avatar Jun 23 '25 07:06 keranlee

Hi, I have been trying to reproduce locomo metric scores. I see that for evaluation you have used mem0 platform instead of mem0.memory.main.Memory. To evaluate without using the mem0 platform, I replaced the all mem0.client.main.MemoryClient methods like add and search in evaluation/src/memzero/add.py and evaluation/src/memzero/search.py with mem0.memory.main.Memory's add and search method. I also replaced the MemoryClient module with the Memory module.

By doing this, I'm able to add and search memories from locomo dataset and I'm also able to get responses for the questions in locomo dataset as well.

My issue is the scores I get in my evaluation is significantly lower than the ones I see in the paper. Can you help me find what's the issue with my code that results in such low performance?

You can find my modified code here

Commands: 1.) To add memories and answer questions: python -m evaluation.run_experiments --technique_type mem0 2.) To evaluate: python -m evaluation.evals --input_file results.json 3.) To get scores: python -m evaluation.generate_scores

Thanks a lot !

I am also reproducing the open source results of mem0 in locomo. Thank you for sharing your code. Can you give me your evaluation results? I want to reproduce it on a small model, such as llama3. Can I use vllm to provide service interfaces for testing?

Siegfried-qgf avatar Jul 16 '25 09:07 Siegfried-qgf

evaluation.generate_scores Hello, I have conducted tests based on your code using the 1024ebedming model and the LLM Qwen3-235B-A22B. I made some modifications to the memory update logic. The results from the lcomco test are as follows. The scores seem relatively low, so I would like to ask what results you obtained in your own tests: Mean Scores Per Category: bleu_score f1_score llm_score count category 1 0.2300 0.2058 0.3298 282 2 0.2956 0.2323 0.2243 321 3 0.2054 0.1483 0.4062 96 4 0.3094 0.2203 0.3746 841 Overall Mean Scores: bleu_score 0.2855 f1_score 0.2156 llm_score 0.3370 dtype: float64

shenshiqiSSQ avatar Aug 12 '25 01:08 shenshiqiSSQ

Hi! The issue still seems to be relevant. I’m a researcher trying to reproduce the mem0 results locally (on the LoCoMo benchmark), but the local scores are significantly lower than the ones reported in the paper.

Could you please share the exact setup or configuration needed to fully reproduce the reported numbers locally?

Thanks!

@prateekchhikara

xtinkt avatar Oct 04 '25 21:10 xtinkt

Hi, I recently ran your evaluation(https://github.com/mem0ai/mem0/tree/main/evaluation) using the Mem0 platform(by passing MEM0_API_KEY) and using GPT-4o-mini, and got the following results:

Mean Scores Per Category:
          bleu_score  f1_score  llm_score  count
category
1             0.1689    0.2568     0.4504    282
2             0.4416    0.5361     0.4361    321
3             0.1831    0.2271     0.4479     96
4             0.2924    0.3541     0.5208    841

Overall Mean Scores:
bleu_score    0.2941
f1_score      0.3663
llm_score     0.4857

However, the performance seems lower than what was reported in the paper.

Could you please share how to correctly reproduce the results from the paper? Thanks a lot for your help!

jisuozhao avatar Oct 23 '25 06:10 jisuozhao

This issue is open since May asking for support on reproducing the claimed LOCOMO Benchmark Results - since these results are a big part of the methods "claim to fame", with the paper abstract stating "Empirical results show that our methods consistently outperform all existing memory systems [...].", I would greatly appreciate some support here too.

@prateekchhikara - you pointed to a mismatch between the the implementation between the cloud solution and the OSS one, maybe this could be made more clear, because to @NITHISHM2410 it obviously wasnt very clear that different performance between the mem0-oss and mem0-cloud systems is to be expected. I dont fault him, because the naming obviously is the same.

Furthermore, @jisuozhao demonstrated even with the Mem0 platform the results from the paper seem off.

There obviously is big interest in this project and the mem0 system in general, I am sure in big part because of the claimed performance. Seeing issues about reproducibility bearly addressed leaves a sour taste in my mouth - this should be something to be taken pride in, not to be ignored.

cc: https://github.com/mem0ai/mem0/issues/3667

LevinFaber avatar Nov 08 '25 19:11 LevinFaber

I get a lower performance (platform, free, not locally)

Mean Scores Per Category:
          bleu_score  f1_score  llm_score  count
category
1             0.1448    0.2284     0.3582    282
2             0.3471    0.4163     0.4766    321
3             0.1365    0.1764     0.3229     96
4             0.2252    0.2774     0.4233    841

Overall Mean Scores:
bleu_score    0.2303
f1_score      0.2911
llm_score     0.4162
dtype: float64

Category 1: Single Hop Category 2: Temporal Category 3: Multi Hop Category 4: Open Domain

Li-Qingyun avatar Nov 10 '25 06:11 Li-Qingyun

I get a lower performance (platform, free, not locally)

Mean Scores Per Category:
          bleu_score  f1_score  llm_score  count
category
1             0.1448    0.2284     0.3582    282
2             0.3471    0.4163     0.4766    321
3             0.1365    0.1764     0.3229     96
4             0.2252    0.2774     0.4233    841

Overall Mean Scores:
bleu_score    0.2303
f1_score      0.2911
llm_score     0.4162
dtype: float64

Category 1: Single Hop Category 2: Temporal Category 3: Multi Hop Category 4: Open Domain

Hi, you might have mistaken the category of the dataset. The correct classification can be determined based on the number of questions in the LoCoMo paper. Additionally, my local reproduction results also show a significant discrepancy with the paper in the open_domain field.

Image

jianminYa avatar Nov 14 '25 16:11 jianminYa

我得到的表现较低(平台,免费,不是本地)

Mean Scores Per Category:
          bleu_score  f1_score  llm_score  count
category
1             0.1448    0.2284     0.3582    282
2             0.3471    0.4163     0.4766    321
3             0.1365    0.1764     0.3229     96
4             0.2252    0.2774     0.4233    841

Overall Mean Scores:
bleu_score    0.2303
f1_score      0.2911
llm_score     0.4162
dtype: float64

类别 1:单跳 类别 2:时态 类别 3:多跳 类别 4:开放域

嗨,您可能弄错了数据集的类别。正确的分类可以根据 LoCoMo 论文中的问题数量来确定。此外,我的本地复制结果也显示与open_domain领域的纸张存在显着差异。

图像

It was copied from https://github.com/mem0ai/mem0/issues/2609#issuecomment-2846494304

but you seem right, thanks

Li-Qingyun avatar Nov 14 '25 17:11 Li-Qingyun