langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Qdrant: Ability to add vectors to already existing collection & set vector size

Open hopkins385 opened this issue 1 year ago • 1 comments

Problem: Whenever we add vectors to an existing qdrant collection that collection gets deleted and re-created from scratch if we use for example VectorstoreIndexCreator().

This PR solves this issue and adds further automatisation abilities without breaking the already existing behavior.

In addition to that this PR adds the ability to configure a fixed vector size, eliminating the need to make a single embedding just to figure out the vector size and further reducing api usage resulting in reduced costs for end-users.

So this PR introduces the following two flags:

recreate_collection = Optional[bool]

vector_size = Optional[int]

How this solution works:

If recreate_collection=False, it checks if collection exists, if not, continue like before this PR ((re)create collection) if collection exists, and recreate_collection=False, and init_from=None then do not execute recreate_collection()

VectorstoreIndexCreator code example (after PR):

        VectorstoreIndexCreator(
            vectorstore_cls=Qdrant,
            text_splitter=RecursiveCharacterTextSplitter(
                chunk_size=int(os.environ.get('CHUNK_SIZE_TOKENS', 400)),
                separators=sepList,
                chunk_overlap=0,
                length_function=get_tokens,
            ),
            vectorstore_kwargs=dict(
                host=os.environ.get('QDRANT_HOST', 'qdrant'),
                port=int(os.environ.get('QDRANT_PORT', '6333')),
                grpc_port=int(os.environ.get('QDRANT_GRPC_PORT', '6334')),
                prefer_grpc=bool(os.environ.get('QDRANT_PREFER_GRPC', 'True')),
                collection_name=self.request.collection_name,
                recreate_collection=False, # <-- this is new, default value is True
                vector_size=int(os.environ.get('VECTOR_DIMENSIONS', '1536')), # <-- this is new, default value is None
            ),
        ).from_documents(self.documents)

--

Maintainer responsibilities:

  • DataLoaders / VectorStores / Retrievers: @rlancemartin, @eyurtsev

hopkins385 avatar Jul 07 '23 09:07 hopkins385

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 10, 2023 4:20pm

vercel[bot] avatar Jul 07 '23 09:07 vercel[bot]

hey @hopkins385, thanks for pr! looks like some of this functionality was added in #7530. should we update to only add what's not already included in there? cc @kacperlukawski

baskaryan avatar Jul 12 '23 01:07 baskaryan

@hopkins385 Why do you want to set the vector size externally? It should be derived from the embeddings, as far as I know.

kacperlukawski avatar Jul 12 '23 08:07 kacperlukawski

@baskaryan Ok, seems as if the pain to get rid of collection recreation was high enough that even two people provided PR's to achieve the same. Let me check in detail.

@kacperlukawski The current implementation makes an additional api call on each request to determine the vector size before saving the vectors. But in many cases the vector size (=dimensions) is known even before starting with the embedding process. In case of openai-embedding the vector-dimensions are 1536. So why performing an api call to determine the vector dimensions if the dimensions are already clear? It produces unnecessary costs and embedding delays.

hopkins385 avatar Jul 12 '23 09:07 hopkins385

Works as expected. Will create a new PR for the vector_size (dimensions).

hopkins385 avatar Aug 21 '23 16:08 hopkins385