langchain VectorstoreIndexCreator questions/suggestions

Hi there,

I've been trying out question answering with docs loaded into a VectorDB. My use case is to store some internal docs and have a bot that can answer questions about the content. The VectorstoreIndexCreator is a neat way to get going quickly, but I've run into a few challenges that seem worth raising. Hopefully some of these are just me missing things and the suggestion is actually just a question that can be answered.

The first is that if you already have a vectorDB (e.g. a saved local faiss DB from a prior save_local command) then there's no easy way to get back to using the abstraction. To work around this I made VectorStoreIndexWrapper importable and just loaded it up from an existing FAISS instance, but maybe some more from_x methods on VectorstoreIndexCreator would be helpful for different scenarios.

The other thing I've run into is not being able to pass through a k value to the query or query_with_sources methods on VectorStoreIndexWrapper. If you follow the setup down it calls as_retriever but I don't see that it passes through search_kwargs to be able to configure that (or pydantic blocks it at least).

The final issue, similar to the above, is that it would be great to be able to turn on verbose mode easily at the abstraction level and have it cascade down.

If there are better ways to do all of the above I'd love to hear them!

Apr 03 '23 06:04 EAYoshi

You can use this syntaxe to pass through k

qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=docsearch.as_retriever(search_kwargs={"k": 1}))

Apr 03 '23 07:04 nasirus

Thank you, however you cannot do this via VectorstoreIndexCreator/VectorStoreIndexWrapper. That setup is done automatically as you can see here

I know we can set these things up separately, however it would be nice to simply be able to re-use the existing high level options for very slightly different use cases

Apr 03 '23 16:04 EAYoshi

EAYoshi hi index_creator = VectorstoreIndexCreator( vectorstore_cls=FAISS, embedding=llama_embeddings,text_splitter=CharacterTextSplitter(separator = "\n",chunk_size=500, chunk_overlap=100)).from_loaders(PDFs_loader) i need save vectors of multiple pdfs in my usecase, so i am saving by this index_creator.vectorstore.save_local('loca') is this correct? 2nd FAISS.similarity_search(query,k=4) how can i use this k value using vectorstoreindexcreator?

does vectorstoreindexcreator automatically use FAISS similary search ?

Apr 20 '23 21:04 MakkiNeutron

Yeah, this is a bummer. You would think that you would get a Vector store you could use as a retriever when using VectoreStoreIndexCreator.

The way I work around this is to just use the VectorStoreIndexCreator to build the VectoreStore in some out of band process. It makes sense as building a VectorStore can be really time consuming when processing a lot of documents.

The key is to persist the VectorStore to disk.

#just for creating the vector store. It can't actually be used as a retriever.
VectorstoreIndexCreator(vectorstore_cls=Chroma, embedding=embeddings, vectorstore_kwargs={ "persist_directory": "/persistance/directory"}).from_loaders([loader])

Later in your actual "chain", you just need to load the VectorStore:

db = Chroma(persist_directory="/persistance/directory", embedding_function=embeddings)

(Reference)

But I agree. These interfaces are very confusing.

Apr 29 '23 23:04 Freyert

@Freyert Kindly, can you explain, how to retrieve Vector Store Index Wrapper from the db.

I stored vector store in the persist directory. index = VectorstoreIndexCreator(vectorstore_cls=Chroma, vectorstore_kwargs={ "persist_directory": "/persistance/directory"}).from_loaders(loaders) Now, I accessed the Chroma db: db = Chroma(persist_directory="/persistance/directory") I want to know how can I retrive the VectorStoreIndexWrapper from the db.

type of the object I want to retrieve is : vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x000001C495717790> <class 'langchain.indexes.vectorstore.VectorStoreIndexWrapper'>

May 03 '23 07:05 Hizafa-Nadeem

I am stuck on this problem as well. I looked at Langchain code for Vectorstore here https://github.com/hwchase17/langchain/blob/master/langchain/indexes/vectorstore.py

I would like to use VectorStoreIndexWrapper.query_with_sources method. Ideally, we can use vectorstore = Chroma(persist_directory="/persistance/directory") to load from the persistent index folder after saving. Then, we can create VectorStoreIndexWrapper(vectorstore=vectorstore) and then everything is good.

The problem is that VectorStoreIndexWrapper cannot be imported (look at __init__.py from the indexes folder, you can only import VectorStoreIndexCreator). So even with Chroma, it's useless because there's no way to create VectorStoreIndexWrapper without VectorStoreIndexCreator, and VectorStoreIndexCreator does not take Chroma as argument in its constructor.

Conclusion: There's no way to load persistent index saved by VectorStoreIndexWrapper because VectorStoreIndexCreator doesn't allow it.

====== Edit ======

My work around is to copy everything in vectorstore.py to a file in your own project and add the from_persistent_index method and call from it, providing the path where you save the persistent data:

class VectorstoreIndexCreator(BaseModel):

    # Existing code ....

    def from_persistent_index(self, path: str) -> VectorStoreIndexWrapper:
        """Load a vectorstore index from a persistent index."""
        vectorstore = self.vectorstore_cls(persist_directory=path, embedding_function=self.embedding)
        return VectorStoreIndexWrapper(vectorstore=vectorstore)

May 04 '23 04:05 eduongAZ

Hello, I was facing the same issues as I was creating the index via VectorstoreIndexCreator with this code:

index = VectorstoreIndexCreator(vectorstore_kwargs={"persist_directory": f'{root_dir}/db'}).from_loaders(loaders)

After running the above script, it generates some new files with this hierarchy:

index/ -- uuid_to_id_f44e0dbd-90b3-467b-9dcb-710ffde8f79b.pkl -- id_to_uuid_f44e0dbd-90b3-467b-9dcb-710ffde8f79b.pkl -- index_f44e0dbd-90b3-467b-9dcb-710ffde8f79b.bin -- index_metadata_f44e0dbd-90b3-467b-9dcb-710ffde8f79b.pkl

I already created a custom vectorstore.py file that @eduongAZ propose, but the result is still the same error:

NoIndexException: Index not found, please create an instance before querying

Am I doing something wrong? is my index folder structure correct? Thanks!!

May 05 '23 17:05 agyson

Facing the same issue as well. I would like to access the Vector Store Index after persisting it in a vector store like Chromadb.

May 09 '23 02:05 meghajohn

you have to store it once before

index.vectorstore.persist()

you could also just monkey patch the VectorstoreIndexCreator with the code from @eduongAZ

from langchain.indexes.vectorstore import VectorStoreIndexWrapper

def from_persistent_index(self, path: str)-> VectorStoreIndexWrapper:
        """Load a vectorstore index from a persistent index."""
        vectorstore = self.vectorstore_cls(persist_directory=path, embedding_function=self.embedding)
        return VectorStoreIndexWrapper(vectorstore=vectorstore)

VectorstoreIndexCreator.from_persistent_index=from_persistent_index

May 10 '23 12:05 Alf42

Thanks @eduongAZ / @Alf42 Worked great for me! Have been stuck on that issue all weekend.

May 15 '23 02:05 psbcode-git

Thanks @eduongAZ and @Alf42 for your help/work on this. However, when I update vector store.py with the additional method provided under VectorstoreIndexCreator and import/attempt to use it as follows:

index = VectorstoreIndexCreator.from_persistent_index('path/to/my/db')
index

I get the following error:

TypeError: from_persistent_index() missing 1 required positional argument: 'path'

Is there something I'm missing in instantiating the class as a proper class?

May 16 '23 19:05 briantfriederich

@briantfriederich It has been a while since I worked on this so I don't know why we need to explicitly provide positional argument. It would be great if someone can enlighten me. Anyway, this works for me

persistent_index_path = "path/to/my/db"
index = VectorstoreIndexCreator().from_persistent_index(path=persistent_index_path)

query = "What is something in this document?"
response = index.query_with_sources(query)

As @Alf42 has pointed out, you need to already saved the persistent index before you can load from it. Here's an example of how you can create VectorstoreIndexWraper and save persistent index:

persistent_index_path = "path/to/my/db"

# Use whatever loader you want. I am using ObsidianLoader.
obsidian_vault_path = "path/to/my/obsidian/vault"
loader = ObsidianLoader(obsidian_vault_path)

self._index = VectorstoreIndexCreator(
    vectorstore_kwargs={"persist_directory": persistent_index_path}
).from_loaders([loader])

query = "What is something in this document?"
response = index.query_with_sources(query)

May 17 '23 17:05 eduongAZ

@agyson I encountered this problem couple times, and I worked around it by simply deleting the whole index folder created by VectorstoreIndexCreator and generate the chromadb index again with

self._index = VectorstoreIndexCreator(
    vectorstore_kwargs={"persist_directory": persistent_index_path}
).from_loaders([loader])

Then things work after that.

It's not clear to me what the issues might be. From my intuition, two things stand out to be the likely causes:

The index is stored as pickle .pkl format, which is version dependent and tend to break when things changes throughout the development. You might need to delete the persistent index and re-generate it after updating langchain.
When you try to load the index, you might have provided the wrong path. I don't know how you call from_persistent_index method, so I am not sure what path you are giving it. The way that works for me is using the same path you gave VectorstoreIndexCreator when you create the persistent index for the first time. So use f'{root_dir}/db' as the path you are loading the persistent index from.

May 17 '23 17:05 eduongAZ

persistent_index_path = "path/to/my/db"
index = VectorstoreIndexCreator().from_persistent_index(path=persistent_index_path)

query = "What is something in this document?"
response = index.query_with_sources(query)

I tried this with a multiple-document-query app I built, and I have an interesting observation. The documents it is trained on are annual reports of companies. So, for example, I can ask it about Apple, and since its universe is only the annual report documents, it should not know that apple is a fruit, right?

That's what I get with the normal/unsaved indexing:

Similarly, I get the same response when I make the "A" of "Apple" as smallcase:

However, when I save the index and load them, it says "Apple is a fruit" when the input query contains "apple" with a smallcase "a":

It only gives the expected response when the "A" is capitalized:

That's quite interesting! In addition to why unsaved vs saved index gives different responses, would you have any idea how the saved index is getting access to information from outside the universe of documents it was provided? Because with the unsaved index, if you ask it anything that's not present in the documents it was trained on, it responds with "I don't know" (which is how it should be).

My guess is that perhaps it is not taking the index from the saved folder, and so instead maybe it is connecting to the internet (somehow) to answer the queries. The index folder looks like this:

What is the correct way to call it? I tried:

index_saved = VectorstoreIndexCreator().from_persistent_index(path="."), as well as index_saved = VectorstoreIndexCreator().from_persistent_index(path="Index/"), as well as index_saved = VectorstoreIndexCreator().from_persistent_index(path="index/index_65b17e9f-1705-43dc-b6f7-9336d05fc3b4.bin")

But none of them seems to be fixing the issue of apple being a fruit.

I think I found the solution with a bit of tinkering. We need to call the persist method on the created index once. Then we can load the saved index and it works perfectly:

index = VectorstoreIndexCreator(vectorstore_cls=Chroma, vectorstore_kwargs={ "persist_directory": "/persistance/directory"}).from_loaders(loaders)
index.vectorstore.persist()

index_saved = VectorstoreIndexCreator().from_persistent_index("/persistance/directory") 
query = "What is something in this document?"
response = index_saved.query_with_sources(query)

May 29 '23 08:05 PrashantSaikia

I'm encountering a related issue when using VectorStoreIndexCreator. The underlying ChatOpenAI model has a default max_tokens limit of 256 (number of output tokens). This is limiting the size of my output. How can I change this, so that the output can be larger? I don't see any parameters that would be passed down all the way to the underlying LLM.

Jun 19 '23 06:06 prashbhat

I don't use VectorstoreIndexCreator, instead I use RetrievalQA directly, I build vectorDB myself (as @nasirus mentioned above): Chroma(collection_name='name', client=chroma_client, embedding_function=embeddings) , but my next openAI call failed with the common issue: max token issue 4097, since the max_tokens(in ChatOpenAI @prashbhat , is this you want?) only control output token length, now I don't have a good solution to fix token length issue, it is maybe a common problem since we fetch document from the vectorDB, so in most cases, the document is pretty large, any good idea about this?

Aug 03 '23 08:08 saaspeter

while using index search in this meathod:

index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

with get_openai_callback() as cb:
  response = index.query(query + conversation)
  print(response)

lets say it gave some response..

nowhow to get the actual input prompt that is sent to llm..?

for example in this senario i have gave querry+conversaion as input, now this meathod will fetch some reference from the embedings stored in the database right so the input will be {query + conversation + reference}.

Now how to get this actual prompt that is sent the llm.....?

Sep 18 '23 19:09 Zuhashaik

Hi, @EAYoshi

I'm helping the LangChain team manage their backlog and am marking this issue as stale. From what I understand, you opened this issue to discuss challenges with the VectorstoreIndexCreator in exploring question answering with documents in a VectorDB. There have been suggestions to add more from_x methods for different scenarios, allow the passing of a k value to the query methods, and enable verbose mode at the abstraction level. Several users have shared their experiences and workarounds, including using the from_persistent_index method, persisting the index, and modifying the VectorstoreIndexCreator class. Other related issues have been raised, such as adjusting the max_tokens limit and retrieving the actual input prompt sent to the llm.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself or the issue will be automatically closed in 7 days. Thank you!

Dec 18 '23 23:12 dosubot[bot]

Why is this pacakge not agnostic?

Why is this package dependant on OpenAI?

What would be an alternative?

The token limit is pushing me away from OpenAI. input: VectorstoreIndexCreator() outout: ValidationError: 1 validation error for OpenAIEmbeddings. Did not find openai_api_key

Mar 14 '24 21:03 ca-mi-lo

langchain langchain copied to clipboard

VectorstoreIndexCreator questions/suggestions

Why is this pacakge not agnostic?

Why is this package dependant on OpenAI?

What would be an alternative?

langchain
langchain copied to clipboard