langchain icon indicating copy to clipboard operation
langchain copied to clipboard

SupabaseVectorStore.from_documents is not working

Open venurawarnasooriya opened this issue 2 years ago • 26 comments

System Info

langchain==0.0.300 supabase==1.1.1

Who can help?

@hwaking @eyurtsev @agola11 @eyurtsev @hwchase17 @agola11

Information

  • [X] The official example notebooks/scripts
  • [X] My own modified scripts

Related Components

  • [ ] LLMs/Chat Models
  • [ ] Embedding Models
  • [ ] Prompts / Prompt Templates / Prompt Selectors
  • [ ] Output Parsers
  • [ ] Document Loaders
  • [X] Vector Stores / Retrievers
  • [ ] Memory
  • [ ] Agents / Agent Executors
  • [ ] Tools / Toolkits
  • [ ] Chains
  • [ ] Callbacks/Tracing
  • [ ] Async

Reproduction

Creation of Supabase client

supabase_url: str = os.environ.get("SUPABASE_URL") # type: ignore supabase_key: str = os.environ.get("SUPABASE_SERVICE_KEY") # type: ignore supabase_client = create_client(supabase_url, supabase_key)

Text Splitter creation

text_splitter = CharacterTextSplitter( chunk_size=800, chunk_overlap=0, )

Embeddings

embeddings = OpenAIEmbeddings()

Loading the document

loader = PyPDFLoader("Alice_in_wonderland2.pdf") pages = loader.load_and_split() docs = text_splitter.split_documents(pages)

Save values to Supabase

vector_store = SupabaseVectorStore.from_documents(documents=docs, embedding=embeddings, client=supabase_client)

Error encountring

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "D:\VSCode\Python\langchain project\supabase-try\test.py", line 34, in vector_store = SupabaseVectorStore.from_documents( File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\langchain\vectorstores\base.py", line 417, in from_documents
return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs) File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\langchain\vectorstores\supabase.py", line 147, in from_texts
cls._add_vectors(client, table_name, embeddings, docs, ids) File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\langchain\vectorstores\supabase.py", line 323, in add_vectors
result = client.from
(table_name).upsert(chunk).execute() # type: ignore File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\postgrest_sync\request_builder.py", line 57, in execute r = self.session.request( File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpx_client.py", line 814, in request return self.send(request, auth=auth, follow_redirects=follow_redirects) File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpx_client.py", line 901, in send response = self._send_handling_auth( File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpx_client.py", line 929, in _send_handling_auth response = self._send_handling_redirects( File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpx_client.py", line 966, in _send_handling_redirects response = self._send_single_request(request) File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpx_client.py", line 1002, in _send_single_request response = transport.handle_request(request) File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpx_transports\default.py", line 218, in handle_request resp = self._pool.handle_request(req) File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\contextlib.py", line 135, in exit self.gen.throw(type, value, traceback) File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpx_transports\default.py", line 77, in map_httpcore_exceptions raise mapped_exc(message) from exc httpx.WriteTimeout: The write operation timed out

I tried changing the code according to the langchain docs as

vector_store = SupabaseVectorStore.from_documents( docs, embeddings, client=supabase_client, table_name="documents", query_name="match_documents", )

Then I encountered the following error

2023-10-05 10:33:29,879:INFO - HTTP Request: POST https://scptrclvtrvcwjdunlrn.supabase.co/rest/v1/documents "HTTP/1.1 404 Not Found" Traceback (most recent call last): File "D:\VSCode\Python\langchain project\supabase-try\test.py", line 34, in vector_store = SupabaseVectorStore.from_documents(

I didnt create the document table in the supabase manually as i need it to be created automatically with the code. And if i need to create it manually i need to know the steps of create that as well and how to integrate it as well. Please help me immediately

Expected behavior

SupabaseVectorStore.from_documents works fine and Store all the embeddings in the vector store.

venurawarnasooriya avatar Oct 05 '23 05:10 venurawarnasooriya

🤖

Hello,

Thank you for providing detailed information about the issue you're facing. Based on the context provided, there are a few potential solutions that might help resolve the issue.

Firstly, it seems like you're using an older version of LangChain (langchain==0.0.300). The latest version is v0.0.308. It's possible that the issue you're facing has been resolved in a newer version. I would recommend updating LangChain to the latest version and see if the issue persists.

Secondly, the error message httpx.WriteTimeout: The write operation timed out suggests that the data you're trying to write is too large, causing the operation to timeout. You could try reducing the size of the data you're writing to the database. This can be done by modifying the chunk size in the _add_vectors method of the SupabaseVectorStore class. Here's how you can do it:

# According to the SupabaseVectorStore JS implementation, the best chunk size
# is 500
chunk_size = 500  # You can modify this value to a smaller number
id_list: List[str] = []
for i in range(0, len(rows), chunk_size):
    chunk = rows[i : i + chunk_size]

    result = client.from_(table_name).upsert(chunk).execute()  # type: ignore

    if len(result.data) == 0:
        raise Exception("Error inserting: No rows added")

    # VectorStore.add_vectors returns ids as strings
    ids = [str(i.get("id")) for i in result.data if i.get("id")]

    id_list.extend(ids)

By reducing the chunk_size value, you can decrease the amount of data processed at once, which might help avoid the timeout error. However, please note that this will also increase the number of requests made to the Supabase server, which might affect performance. This solution was suggested in a similar issue #7823.

Lastly, the error message HTTP/1.1 404 Not Found suggests that the table you're trying to write to does not exist in your Supabase database. Please ensure that the table name you're trying to write to exists in your Supabase database. If the table doesn't exist, it could lead to a 404 Not Found error. In your case, the table name should be "documents".

I hope this helps! If you're still facing issues, please provide more details about your implementation and the exact error message you're receiving.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] avatar Oct 05 '23 05:10 dosubot[bot]

Actually This Bot's answer works for me along with this document which was recommended by the bot. This code inside the document works well but when i try to store text chunk embeddings of a pdf It keeps on giving me the issue httpx.WriteTimeout: The write operation timed out

I updated my langchain version to v0.0.308 and

My code is as follows after some corrections

Superbase table creation

create table
  documents (
    id uuid primary key,
    content text, 
    metadata jsonb, 
    embedding vector (1536) 
  );
create function match_documents (
  query_embedding vector (1536),
  filter jsonb default '{}'
) returns table (
  id uuid,
  content text,
  metadata jsonb,
  similarity float
) language plpgsql as $$
#variable_conflict use_column
begin
  return query
  select
    id,
    content,
    metadata,
    1 - (documents.embedding <=> query_embedding) as similarity
  from documents
  where metadata @> filter
  order by documents.embedding <=> query_embedding;
end;
$$;

client

supabase_url: str = os.environ.get("SUPABASE_URL")  # type: ignore
supabase_key: str = os.environ.get("SUPABASE_SERVICE_KEY")  # type: ignore
supabase_client = create_client(supabase_url, supabase_key)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=0
)
loader = PyPDFLoader('Alice_in_wonderland2.pdf')
data = loader.load_and_split()
docs = text_splitter.split_documents(data)
vectorstore = SupabaseVectorStore.from_documents(docs, embeddings, client=supabase_client, table_name="documents", query_name="match_documents")

But I encountered the same problem as previous because it is not allowing me to upload all the embeddings that are made from the pdf. I tried reducing the size of the chunk as well like this

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=0
)

But still I got the same issue as like this.

Traceback (most recent call last):
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpcore\_exceptions.py", line 10, in map_exceptions
    yield
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpcore\_backends\sync.py", line 38, in write
    n = self._sock.send(buffer)
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\ssl.py", line 1173, in send
    return self._sslobj.write(data)
socket.timeout: The write operation timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpx\_transports\default.py", line 60, in map_httpcore_exceptions 
    yield
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpx\_transports\default.py", line 218, in handle_request
    resp = self._pool.handle_request(req)
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpcore\_sync\connection_pool.py", line 262, in handle_request    
    raise exc
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpcore\_sync\connection_pool.py", line 245, in handle_request    
    response = connection.handle_request(request)
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpcore\_sync\connection.py", line 96, in handle_request
    return self._connection.handle_request(request)
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpcore\_sync\http11.py", line 121, in handle_request
    raise exc
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpcore\_sync\http11.py", line 90, in handle_request
    self._send_request_body(**kwargs)
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpcore\_sync\http11.py", line 144, in _send_request_body
    self._send_event(event, timeout=timeout)
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpcore\_sync\http11.py", line 153, in _send_event
    self._network_stream.write(bytes_to_send, timeout=timeout)
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpcore\_backends\sync.py", line 39, in write
    buffer = buffer[n:]
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\contextlib.py", line 135, in __exit__
    self.gen.throw(type, value, traceback)
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpcore\_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.WriteTimeout: The write operation timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "D:\VSCode\Python\langchain project\supabase-try\test2.py", line 82, in <module>
    vectorstore = SupabaseVectorStore.from_documents(
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\langchain\schema\vectorstore.py", line 422, in from_documents      
    return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs)
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\langchain\vectorstores\supabase.py", line 147, in from_texts       
    cls._add_vectors(client, table_name, embeddings, docs, ids)
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\langchain\vectorstores\supabase.py", line 323, in _add_vectors     
    result = client.from_(table_name).upsert(chunk).execute()  # type: ignore
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\postgrest\_sync\request_builder.py", line 57, in execute
    r = self.session.request(
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpx\_client.py", line 814, in request
    return self.send(request, auth=auth, follow_redirects=follow_redirects)
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpx\_client.py", line 901, in send
    response = self._send_handling_auth(
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpx\_client.py", line 929, in _send_handling_auth
    response = self._send_handling_redirects(
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpx\_client.py", line 966, in _send_handling_redirects
    response = self._send_single_request(request)
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpx\_client.py", line 1002, in _send_single_request
    response = transport.handle_request(request)
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpx\_transports\default.py", line 218, in handle_request
    resp = self._pool.handle_request(req)
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\contextlib.py", line 135, in __exit__
    self.gen.throw(type, value, traceback)
  File "C:\Users\User\AppData\Local\Programs\Python\Python39\lib\site-packages\httpx\_transports\default.py", line 77, in map_httpcore_exceptions 
    raise mapped_exc(message) from exc
httpx.WriteTimeout: The write operation timed out

line 82 is:

vectorstore = SupabaseVectorStore.from_documents(docs, embeddings, client=supabase_client, table_name="documents", query_name="match_documents")

@hwaking @eyurtsev @agola11 @eyurtsev @hwchase17 @agola11 Please help me to solve this problem. Please give me a proper solution to upload all the pdf text as embeddings to supabase vector store.

venurawarnasooriya avatar Oct 05 '23 14:10 venurawarnasooriya

I have the same problem

Salama1429 avatar Oct 06 '23 05:10 Salama1429

I was seeing the same issue with all sorts of different doc lengths and after coming across this post I updated everything to the below versions and now it is working python=3.11.6 supabase-1.2.0 langchain=0.0.314

j1philli avatar Oct 14 '23 10:10 j1philli

I was seeing the same issue with all sorts of different doc lengths and after coming across this post I updated everything to the below versions and now it is working python=3.11.6 supabase-1.2.0 langchain=0.0.314

Thanks for replying! I've tried with these versions as well. But i didnt go through any progress yet!

venurawarnasooriya avatar Oct 14 '23 10:10 venurawarnasooriya

I updated everything to the below versions and now it is working python=3.11.6 supabase-1.2.0 langchain=0.0.314

interestingly enough came across one just now that failed with the same error again. but not all are!

j1philli avatar Oct 14 '23 21:10 j1philli

I updated everything to the below versions and now it is working python=3.11.6 supabase-1.2.0 langchain=0.0.314

interestingly enough came across one just now that failed with the same error again. but not all are!

Hello @j1philli , I am trying to reproduce this bug. Did you happen to find any patterns about the one that failed versus the ones that did not fail?

JoshuaConcon avatar Oct 19 '23 19:10 JoshuaConcon

The one that is failing is massive. I run into rate limiting from openAI on the one that fails vs the ones that don't. I am taking a massive json file, making documents out of each object(29,000-30,000) then running: CleanShot 2023-10-19 at 15 08 46 CleanShot 2023-10-19 at 15 09 24 Let me know how else I can help!

j1philli avatar Oct 19 '23 21:10 j1philli

Hi, I followed the instructions in https://python.langchain.com/docs/integrations/vectorstores/supabase and did not encounter the error. For creating the document table, you need to add and run the SQL queries that the docs provided to SQL Editor on https://supabase.com/ if you have already built the database and connected to it. You also need to enable the pgvector extension in the query.

33William33 avatar Oct 20 '23 23:10 33William33

I have also followed these steps and it works on others tables setup the same way.

j1philli avatar Oct 21 '23 00:10 j1philli

I encountered the same write timeout exception as initially reported by the op. In my case, I narrowed the problem down to the write calls to Supabase vs. the embedding API. I cached the embeddings and directly invoked add_vectors to try to isolate. I noticed there's a comment in the code that indicates that internally add_vectors chunks by 500 items (based on what worked for JS).

What worked in my case was to send even smaller chunks of data, small being relative. Not an accurate estimate, but sending around 1.5 MB of total data worked fine, going past 2 MB fails with a write timeout exception. I hit the write timeout exception at the ~2 MB value when I chunk my docs in 200 items at a time, with 200 vectors + 200 (smallish) docs being shipped in one call to Supabase.

So my workaround for now is to ship only 100 chunks and 100 vectors in each call to add_vectors.

vasanth-asokan avatar Oct 23 '23 22:10 vasanth-asokan

@danielchalef any ideas regarding what we are seeing?

j1philli avatar Oct 25 '23 00:10 j1philli

So my workaround for now is to ship only 100 chunks and 100 vectors in each call to add_vectors.

put up a pr! I don't know how to use a local python package in my code to test it out if someone can point me in the right direction

j1philli avatar Oct 25 '23 00:10 j1philli

I no longer use supbase and, unfortunately, don't have time to dig in to optimize this. The VectorStore can likely be simplified and improved significantly by refactoring to use the supbase vec package:

https://supabase.com/vector

danielchalef avatar Oct 25 '23 01:10 danielchalef

I no longer use supbase and unfortunately, don't have time to dig in to optimize this. The VectorStore can likely be simplified and improved significantly by refactoring to use the supbase vec package:

https://supabase.com/vector

No worries, just thought a tag was worth a shot. I'll see if I can comprehend this and integrate vec

j1philli avatar Oct 25 '23 02:10 j1philli

settings chunk size smaller worked on my large dataset. working on a PR

j1philli avatar Oct 25 '23 07:10 j1philli

PR is up! Made it so chunk_size is configurable in the SupabaseVectoreStore call! Hopefully gets in soon so folks can tinker to get there databases back up and running!

vector_store = SupabaseVectorStore.from_documents(documents=docs, embedding=embeddings, client=supabase_client)

Like this: vector_store = SupabaseVectorStore.from_documents(documents=docs, embedding=embeddings, client=supabase_client, chunk_size=100)

j1philli avatar Oct 26 '23 05:10 j1philli

@venurawarnasooriya @Salama1429 @vasanth-asokan the PR is in the latest langchain version! @venurawarnasooriya make sure to test it out and close the issue if your are resolved. For my large dataset that was failing, I set chunk size to 100

j1philli avatar Oct 28 '23 18:10 j1philli

Thank you so much @j1philli!!

vasanth-asokan avatar Oct 29 '23 01:10 vasanth-asokan

PR is up! Made it so chunk_size is configurable in the SupabaseVectoreStore call! Hopefully gets in soon so folks can tinker to get there databases back up and running!

vector_store = SupabaseVectorStore.from_documents(documents=docs, embedding=embeddings, client=supabase_client)

Like this: vector_store = SupabaseVectorStore.from_documents(documents=docs, embedding=embeddings, client=supabase_client, chunk_size=100)

In the JS SDK, this option is called upsertBatchSize if I'm not mistaken. Could we change one or the other to make the naming consistent?

pmespresso avatar Nov 05 '23 10:11 pmespresso

In the JS SDK, this option is called upsertBatchSize if I'm not mistaken. Could we change one or the other to make the naming consistent?

That is a good call. Didn't think to check if this was already possible in JS. Probably close enough to release to just straight change it. Will do that today.

j1philli avatar Nov 05 '23 19:11 j1philli

@pmespresso do you have a link to where it is mentioned in JS SDK?

j1philli avatar Nov 05 '23 19:11 j1philli

@j1philli sure thing :)

static async fromDocuments(

/**
     * Creates a new SupabaseVectorStore instance from an array of documents.
     * @param docs The documents to create the instance from.
     * @param embeddings The embeddings to use.
     * @param dbConfig The configuration for the Supabase database.
     * @returns A promise that resolves with a new SupabaseVectorStore instance when the instance has been created.
     */
    static fromDocuments(docs: Document[], embeddings: Embeddings, dbConfig: SupabaseLibArgs): Promise<SupabaseVectorStore>;

https://github.com/langchain-ai/langchainjs/blob/893a45057f2949f5a8ed9da80b4f5cacaf63023e/langchain/src/vectorstores/supabase.ts#L37

/**
 * Interface for the arguments required to initialize a Supabase library.
 */
export interface SupabaseLibArgs {
    client: SupabaseClient;
    tableName?: string;
    queryName?: string;
    filter?: SupabaseMetadata | SupabaseFilterRPCCall;
    upsertBatchSize?: number;
}

the docs: https://js.langchain.com/docs/api/vectorstores_supabase/classes/SupabaseVectorStore

pmespresso avatar Nov 06 '23 06:11 pmespresso

By the way, another thing I just noticed is that the QARetrieval type operations kept timing out on read operations even with the chunk size (setting smaller chunk size did help with writes).

In my case what I'd forgotten to do was actually create an index on my pg_vector table. That fixed it instantly (duh). Hope this helps OP or more likely myself in the future when I inevitably forget the same thing

pmespresso avatar Nov 06 '23 12:11 pmespresso

followed up in the pr with the langchain guys to get there opinion on the rename

j1philli avatar Nov 08 '23 06:11 j1philli

I encountered the same WriteTimeout issue with SupabaseVectorStore.from_documents. Reducing the chunk size didn't help. However, setting postgrest_client_timeout to None in ClientOptions while creating the Supabase client resolved it. Here's the snippet:

from supabase.lib.client_options import ClientOptions
client_options = ClientOptions(postgrest_client_timeout=None)
supabase = create_client(supabase_url, supabase_key, options=client_options)

Hope it helps

HasnainKhanNiazi avatar Jan 24 '24 12:01 HasnainKhanNiazi

I encountered the same WriteTimeout issue with SupabaseVectorStore.from_documents. Reducing the chunk size didn't help. However, setting postgrest_client_timeout to None in ClientOptions while creating the Supabase client resolved it. Here's the snippet:

from supabase.lib.client_options import ClientOptions
client_options = ClientOptions(postgrest_client_timeout=None)
supabase = create_client(supabase_url, supabase_key, options=client_options)

Hope it helps

This worked for me, thanks!!

Erybor avatar Feb 08 '24 09:02 Erybor

I encountered the same WriteTimeout issue with SupabaseVectorStore.from_documents. Reducing the chunk size didn't help. However, setting postgrest_client_timeout to None in ClientOptions while creating the Supabase client resolved it. Here's the snippet:

from supabase.lib.client_options import ClientOptions
client_options = ClientOptions(postgrest_client_timeout=None)
supabase = create_client(supabase_url, supabase_key, options=client_options)

Hope it helps

Solved my issue.

ar-nadeem avatar Mar 28 '24 15:03 ar-nadeem