langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Implement from_documents class method in weaviate VectorStore

Open hsm207 opened this issue 2 years ago • 7 comments

We should implement all abstract methods in VectorStore so that users can use weaviate as the vector store for any use case.

Context:

https://github.com/hwchase17/langchain/blob/763f87953686a69897d1f4d2260388b88eb8d670/langchain/vectorstores/base.py#L104-L113

hsm207 avatar Apr 13 '23 21:04 hsm207

Hi @hsm207 ,

I think It doesn't need to be overridden by the parent. It only calls the from_texts method which is already implemented. The page_content and metadata are allready Document attributes. It is the case for all other vectorstores (e.g., faiss, milvus, etc.). Is it possible for you to test it and let me know?

Best, Pouyan

Pouyanpi avatar Apr 18 '23 11:04 Pouyanpi

Looks like in from_texts here, we are only feeding the URL into the weaviate client. However, this is how weaviate recommends initializing the client when using their hosted version:

auth_config = weaviate.auth.AuthApiKey(api_key=WEAVIATE_API_KEY)
weaviate_client = weaviate.Client(
    url=WEAVIATE_URL,
    additional_headers={
        'X-OpenAI-Api-Key': os.environ["OPENAI_API_KEY"]
    },
    auth_client_secret=auth_config
)

Not sure what the best way of differentiating kwargs passed here is, maybe doing some popping from the kwargs so they don't get passed downstream once the client is created.

diego-escobedo avatar Apr 19 '23 20:04 diego-escobedo

Not sure what the best way of differentiating kwargs passed here is, maybe doing some popping from the kwargs so they don't get passed downstream once the client is created.

I suggest creating a dataclass and call it something like WeaviateConfig that contains the possible configs for the weaviiate client instead of relyingon kwargs. This way, it helps with type checking and intellisense.

hsm207 avatar Apr 19 '23 20:04 hsm207

@diego-escobedo @hsm207 the weaviate client repo has just merged a config dataclass PR which I am happy to add support for here if there are no objections.

cs0lar avatar Apr 29 '23 10:04 cs0lar

@cs0lar in my opinion, let's wait until that feature makes it to weaviate's official docs to give time for the developers to finalise the interface.

hsm207 avatar May 02 '23 10:05 hsm207

@hsm207 agreed!

cs0lar avatar May 02 '23 11:05 cs0lar

@Pouyanpi

It is the case for all other vectorstores (e.g., faiss, milvus, etc.). Is it possible for you to test it and let me know?

chroma, pgvector and tair to implement their own from_documents method. Do you know why this is needed?

hsm207 avatar May 15 '23 12:05 hsm207

can confirm that from_documents works fine now.

hsm207 avatar May 31 '23 02:05 hsm207

from_documents doesn't work for me, unfortunately. I double-checked the url and api key. Any ideas?

db = Weaviate.from_documents( documents=docs, embedding=embeddings, weaviate_url=weaviate_cluster_url, weaviate_api_key=weaviate_api_key, by_text=False, )

[ERROR] Batch ConnectionError Exception occurred! Retrying in 2s. [1/3] [ERROR] Batch ConnectionError Exception occurred! Retrying in 4s. [2/3] [ERROR] Batch ConnectionError Exception occurred! Retrying in 6s. [3/3] Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( ^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 398, in _make_request conn.request(method, url, **httplib_request_kw) File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 244, in request super(HTTPConnection, self).request(method, url, body=body, headers=headers) File "/usr/local/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1283, in request self._send_request(method, url, body, headers, encode_chunked) File "/usr/local/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1329, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/usr/local/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1278, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/local/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 1077, in _send_output self.send(chunk) File "/usr/local/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/http/client.py", line 999, in send self.sock.sendall(data) File "/usr/local/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ssl.py", line 1241, in sendall v = self.send(byte_view[count:]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/Cellar/[email protected]/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ssl.py", line 1210, in send return self._sslobj.write(data) ^^^^^^^^^^^^^^^^^^^^^^^^ TimeoutError: The write operation timed out

widike avatar Jun 08 '23 10:06 widike

from the stack trace, it looks like you have some issue with your machine's ssl config

hsm207 avatar Jun 08 '23 10:06 hsm207

I found out that the error is caused by one specific pdf in docs. If I leave it out it works nicely.

widike avatar Jun 08 '23 11:06 widike

whoa, that is strange

hsm207 avatar Jun 08 '23 12:06 hsm207