adding memory is taking 20 secs
🐛 Describe the bug
Hi all, not sure if this is a bug. Noticed that adding memory according to the self-hosted tutorial in the README is taking ~20 secs. Is it supposed to take this long?
If so, is the expectation to async the process of adding memory?
from mem0 import Memory
from ollama import chat
from ollama import ChatResponse
from time import perf_counter
config = {
"vector_store": {
"provider": "qdrant",
"config": {
"collection_name": "test",
"host": "localhost",
"port": 6333,
"embedding_model_dims": 768, # Change this according to your local model's dimensions
},
},
"llm": {
"provider": "ollama",
"config": {
"model": "llama3.1:latest",
"temperature": 0,
"max_tokens": 2000,
"ollama_base_url": "http://localhost:11434", # Ensure this URL is correct
},
},
"embedder": {
"provider": "ollama",
"config": {
"model": "nomic-embed-text:latest",
# Alternatively, you can use "snowflake-arctic-embed:latest"
"ollama_base_url": "http://localhost:11434",
},
},
}
memory = Memory.from_config(config)
def chat_with_memories(message: str, user_id: str = "default_user") -> str:
# Retrieve relevant memories
relevant_memories = memory.search(query=message, user_id=user_id, limit=10)
print("relevant_memories", relevant_memories)
memories_str = "\n".join(f"- {entry['memory']}" for entry in relevant_memories["results"])
# Generate Assistant response
system_prompt = f"You are a helpful AI. Answer the question based on query and memories.\nUser Memories:\n{memories_str}"
messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": message}]
start = perf_counter()
response: ChatResponse = chat(model='llama3.2', messages=messages)
elapsed = perf_counter() - start
print("time elapsed chat llm", elapsed)
assistant_response = response['message']['content']
# Create new memories from the conversation
messages.append({"role": "assistant", "content": assistant_response})
start = perf_counter()
memory.add(messages, user_id=user_id)
elapsed = perf_counter() - start
print("time elapsed memory add", elapsed)
return assistant_response
def main():
print("Chat with AI (type 'exit' to quit)")
while True:
user_input = input("You: ").strip()
if user_input.lower() == 'exit':
print("Goodbye!")
break
print(f"AI: {chat_with_memories(user_input)}")
if __name__ == "__main__":
main()
When adding memory, mem0 can be slow when calling the LLM. You might want to try a faster model like gpt-4o-mini to see if it improves speed. On my self-hosted setup, adding memory takes around 2 secs. Additionally, you should use async with the memory add operator to prevent blocking execution.
hey @lucas-castalk are you referring to changing the embedding model or the LLM? the llama 3.2 LLM chat response took 3 secs, but adding memory with nomic-embed-text took 20 secs
I think you can try testing the response time of your local llm and embedding model first to verify that they run as expected.
i use deepseek-V3-671B as the local LLM(run with 16*A100, SGLang), and bge-m3 as the local embedding model(run with A100), but when adding a email still cost more than 20s, and search() took even more times.
Hi :)
I reached the same problem. The memory.add operation involves calling a LLM-as-a-Judge to define whether your new interaction should or not be saved as a new memory. This is better explained in the Mem0 official documentation of this method.
Advantage: you have a smart memory saving approach, which will add the most relevant information into your memory.
Disadvantage: on every add operation, you will always have to wait for the a LLM response, which takes long, and there's not much we can do about it.
Solution
Just call the add operation asynchronously, as @lucas-castalk responded. I did it like this:
Before:
mem0 = Memory.from_config(config=your_config)
mem0.add(f"User: {messages[-1].content}\nAssistant: {response.content}", user_id=user_id)
After (requires creating Mem0 client asynchronously:
mem0 = await AsyncMemory.from_config(config=your_config)
asyncio.create_task(mem0.add(f"User: {messages[-1].content}\nAssistant: {response.content}", user_id=user_id))
Hope it helped!
Thanks for the tip @guilherme-deschamps I was gonna do the same,
Just one more question, any tips on like making a controller for the agent to decide when to tap into LTM, cause all my STM is handled via Redis cache, and I use mem0 + PG vector for LTM
I was just worried if I put in some sort of judge ( basically a similarity score or something with a chain to decide ) it might slow the response time down, so I wondering if there is another better judge of similarity to determine if the retrieved context is enough or will additional LTM will be required.
I mean it's similar to human nature on how we think longer to respond to questions that we haven't come across in a while, but customers aren't very patient haha
Hey @akshatsabavat , sorry for the late!
To be honest, I don't have a super academic response about it... I generally just add LTM as context anyway, since having a judge in the middle would affect too much the response time. The options (as I see) are:
- Call LLM judge to define if more context is needed, and IF NEEDED (which might happen often), add LTM as context.
- Just use LTM anyway, which is faster then having a judge, and having (contextually relevant) context shouldn't be bad. Calling RAG is usually faster than calling LLM (judge), so I just go for it 😅
Also, considering your example comparing to human memory, your cache can help you deal with your most popular requests, so you might not reach the LTM here.
Again, not very academic response, if you end up going for something else, please tell me about it.
Hey @guilherme-deschamps lmaooo, I'm so sorry for such a late response, I was busy over the summer.
But here is some of the stuff I implemented,
- Short-term is usually always called on request, because it makes sense as it's needed to provide conversation flow for the most recent chat memory when talking to a user. This context object is passed along to other agents, too, in the system, to make sure everything looks in place since
- I made an LTM tool that the agent can call, and I specifically had instructions to call it during times when the user often referred back to old conversations or maybe some of their booking or flight preferences and then those can be used across the other tools the agent would need
So the above removed, the annoying call I did cause I was calling STM + LTM together, now my requests on an average are cheaper and faster, and I am thinking later down the line if I keep working on this, I keep chipping away chat history slowly and slowly and store then as summaries in the LTM for the agent to refer back to and then store them in a DB and call on scroll when the user scrolls up
i use deepseek-V3-671B as the local LLM(run with 16*A100, SGLang), and bge-m3 as the local embedding model(run with A100), but when adding a email still cost more than 20s, and search() took even more times.
@wishfay You're attempting to store a long email in memory, which may take longer than usual and cause latency. To reduce latency, consider using async, and for improved overall performance, use the mem0 platform.
@shon3005 You can try using async version or try changing the embedding model as suggested by @lucas-castalk . This will significantly improve the results. You can use the platform mem0 to overcome this.
Hey @akshatsabavat , sorry for the late!
To be honest, I don't have a super academic response about it... I generally just add LTM as context anyway, since having a judge in the middle would affect too much the response time. The options (as I see) are:
- Call LLM judge to define if more context is needed, and IF NEEDED (which might happen often), add LTM as context.
- Just use LTM anyway, which is faster then having a judge, and having (contextually relevant) context shouldn't be bad. Calling RAG is usually faster than calling LLM (judge), so I just go for it 😅
Also, considering your example comparing to human memory, your cache can help you deal with your most popular requests, so you might not reach the LTM here.
Again, not very academic response, if you end up going for something else, please tell me about it.
@guilherme-deschamps Good catch and suggestion. RAG and memory are different aspects of the application. @akshatsabavat Currently we do not offer STM with the OSS, but this will be roled out in future. You can implement a hybrid solution as suggested.
Hey @guilherme-deschamps lmaooo, I'm so sorry for such a late response, I was busy over the summer.
But here is some of the stuff I implemented,
- Short-term is usually always called on request, because it makes sense as it's needed to provide conversation flow for the most recent chat memory when talking to a user. This context object is passed along to other agents, too, in the system, to make sure everything looks in place since
- I made an LTM tool that the agent can call, and I specifically had instructions to call it during times when the user often referred back to old conversations or maybe some of their booking or flight preferences and then those can be used across the other tools the agent would need
So the above removed, the annoying call I did cause I was calling STM + LTM together, now my requests on an average are cheaper and faster, and I am thinking later down the line if I keep working on this, I keep chipping away chat history slowly and slowly and store then as summaries in the LTM for the agent to refer back to and then store them in a DB and call on scroll when the user scrolls up
Nice idea @akshatsabavat , thanks for sharing, I might use your approach for the LTM in future too! hehe :)
Lately I've been trying to reduce agent memory usage to make it as low as possible. In the projects that I've been working, the no-memory and highly specialized agents have been working well to automate small parts of workflows! Suggestion came from this article, and indeed made a lot of sense for me.
Btw, if you (or anyone) would like to talk anytime, here is my linkedin, we could connect there!
@parshvadaftari just came here to thank you! Just received an email notifying that in the next versions of mem0, the memory addition operation will be asynchronous by default! Reference to the change is here, and it's a great change. Thank you :)
Hey @guilherme-deschamps lmaooo, I'm so sorry for such a late response, I was busy over the summer. But here is some of the stuff I implemented,
- Short-term is usually always called on request, because it makes sense as it's needed to provide conversation flow for the most recent chat memory when talking to a user. This context object is passed along to other agents, too, in the system, to make sure everything looks in place since
- I made an LTM tool that the agent can call, and I specifically had instructions to call it during times when the user often referred back to old conversations or maybe some of their booking or flight preferences and then those can be used across the other tools the agent would need
So the above removed, the annoying call I did cause I was calling STM + LTM together, now my requests on an average are cheaper and faster, and I am thinking later down the line if I keep working on this, I keep chipping away chat history slowly and slowly and store then as summaries in the LTM for the agent to refer back to and then store them in a DB and call on scroll when the user scrolls up
Nice idea @akshatsabavat , thanks for sharing, I might use your approach for the LTM in future too! hehe :)
Lately I've been trying to reduce agent memory usage to make it as low as possible. In the projects that I've been working, the no-memory and highly specialized agents have been working well to automate small parts of workflows! Suggestion came from this article, and indeed made a lot of sense for me.
Btw, if you (or anyone) would like to talk anytime, here is my linkedin, we could connect there!
Hey @guilherme-deschamps thanks for the compliment, and yes I shot you a connection reqquest on LinkedIn, my github is filled with PRs from my LAB lmao so it's hard for me to follow and check