langchain icon indicating copy to clipboard operation
langchain copied to clipboard

(De)Serializing memories in TimeWeightedVectorStore

Open EandrewJones opened this issue 1 year ago • 0 comments

Feature request

I propose refactoring the TimeWeightedVectorStore to use epoch milliseconds as the time format instead of directly passing in a python datetime object as the created_at and last_accessed_at fields.

The current implementation does allow users to specify a current_time when adding documents, but on retrieval it always uses datetime.now() to rescore docs.

Adding:

 def add_documents(self, documents: List[Document], **kwargs: Any) -> List[str]:
        """Add documents to vectorstore."""
        current_time = kwargs.get("current_time", datetime.now())
        # Avoid mutating input documents
        dup_docs = [deepcopy(d) for d in documents]
        for i, doc in enumerate(dup_docs):
            if "last_accessed_at" not in doc.metadata:
                doc.metadata["last_accessed_at"] = current_time
            if "created_at" not in doc.metadata:
                doc.metadata["created_at"] = current_time
            doc.metadata["buffer_idx"] = len(self.memory_stream) + i
        self.memory_stream.extend(dup_docs)
        return self.vectorstore.add_documents(dup_docs, **kwargs)

Retrieval:

    def get_relevant_documents(self, query: str) -> List[Document]:
        """Return documents that are relevant to the query."""
        current_time = datetime.now()
        docs_and_scores = {
            doc.metadata["buffer_idx"]: (doc, self.default_salience)
            for doc in self.memory_stream[-self.k :]
        }
        # If a doc is considered salient, update the salience score
        docs_and_scores.update(self.get_salient_docs(query))
        rescored_docs = [
            (doc, self._get_combined_score(doc, relevance, current_time))
            for doc, relevance in docs_and_scores.values()
        ]
        rescored_docs.sort(key=lambda x: x[1], reverse=True)
        result = []
        # Ensure frequently accessed memories aren't forgotten
        current_time = datetime.now()
        for doc, _ in rescored_docs[: self.k]:
            # TODO: Update vector store doc once `update` method is exposed.
            buffered_doc = self.memory_stream[doc.metadata["buffer_idx"]]
            buffered_doc.metadata["last_accessed_at"] = current_time
            result.append(buffered_doc)
        return result

What's the problem with this? datetime.now() is not JSON (de-)serializable.

Motivation

While this class works fine in the example using a local FAISS vectorstore, it requires you to define custom JSONEncoder and Decoders if you want to persist to/instantiate from local. It also doesn't work with the Redis Vectorstore because that serializes to JSON before storing the docs.

Your contribution

There are a couple options to fix this, the least invasive of which is:

  • use epoch ms as the fixed format for storing time values b/c it's nice for serDe
  • convert the times into python datetime objects in the _get_hours_passed function

This function is the only place where the timestamps need to be python datetime objects. The alternative fixes involve expanding the API surface area of any vectorstore where JSON encoders and decoders are needed to solve this problem. No one likes prop drilling.

EandrewJones avatar May 09 '23 02:05 EandrewJones