langchain
langchain copied to clipboard
(De)Serializing memories in TimeWeightedVectorStore
Feature request
I propose refactoring the TimeWeightedVectorStore to use epoch milliseconds as the time format instead of directly passing in a python datetime
object as the created_at
and last_accessed_at
fields.
The current implementation does allow users to specify a current_time when adding documents, but on retrieval it always uses datetime.now() to rescore docs.
Adding:
def add_documents(self, documents: List[Document], **kwargs: Any) -> List[str]:
"""Add documents to vectorstore."""
current_time = kwargs.get("current_time", datetime.now())
# Avoid mutating input documents
dup_docs = [deepcopy(d) for d in documents]
for i, doc in enumerate(dup_docs):
if "last_accessed_at" not in doc.metadata:
doc.metadata["last_accessed_at"] = current_time
if "created_at" not in doc.metadata:
doc.metadata["created_at"] = current_time
doc.metadata["buffer_idx"] = len(self.memory_stream) + i
self.memory_stream.extend(dup_docs)
return self.vectorstore.add_documents(dup_docs, **kwargs)
Retrieval:
def get_relevant_documents(self, query: str) -> List[Document]:
"""Return documents that are relevant to the query."""
current_time = datetime.now()
docs_and_scores = {
doc.metadata["buffer_idx"]: (doc, self.default_salience)
for doc in self.memory_stream[-self.k :]
}
# If a doc is considered salient, update the salience score
docs_and_scores.update(self.get_salient_docs(query))
rescored_docs = [
(doc, self._get_combined_score(doc, relevance, current_time))
for doc, relevance in docs_and_scores.values()
]
rescored_docs.sort(key=lambda x: x[1], reverse=True)
result = []
# Ensure frequently accessed memories aren't forgotten
current_time = datetime.now()
for doc, _ in rescored_docs[: self.k]:
# TODO: Update vector store doc once `update` method is exposed.
buffered_doc = self.memory_stream[doc.metadata["buffer_idx"]]
buffered_doc.metadata["last_accessed_at"] = current_time
result.append(buffered_doc)
return result
What's the problem with this? datetime.now()
is not JSON (de-)serializable.
Motivation
While this class works fine in the example using a local FAISS vectorstore, it requires you to define custom JSONEncoder and Decoders if you want to persist to/instantiate from local. It also doesn't work with the Redis Vectorstore because that serializes to JSON before storing the docs.
Your contribution
There are a couple options to fix this, the least invasive of which is:
- use epoch ms as the fixed format for storing time values b/c it's nice for serDe
- convert the times into python datetime objects in the
_get_hours_passed
function
This function is the only place where the timestamps need to be python datetime objects. The alternative fixes involve expanding the API surface area of any vectorstore where JSON encoders and decoders are needed to solve this problem. No one likes prop drilling.