langchain icon indicating copy to clipboard operation
langchain copied to clipboard

The from_documents in opensensearch may not be working as expected.

Open engineer-matsuo opened this issue 1 year ago β€’ 3 comments

Name: langchain Version: 0.0.146

Name: opensearch-py Version: 2.2.0

Even if I build opensearch in docker and run it as per langchain's official documentation, the index is randomly numbered and data is created. I am not sure if this is how it is supposed to work. I was imagining that multiple documents are usually added to a single index.

Also, I get the following error when I specify index_name.

File "/usr/local/lib/python3.10/site-packages/opensearchpy/connection/base.py", line 301, in _raise_error raise HTTP_EXCEPTIONS.get(status_code, TransportError)( opensearchpy.exceptions.RequestError: RequestError(400, 'resource_already_exists_exception', 'index [test_index/GEdIKgfrRO24XoRbcJCeVg ] already exists')

There seems to be a duplicate index, as client.indices.create(index=index_name, body=mapping) in the from_texts function of opensearch_vector_search.py is always executed. I assume this is because the client.indices.create(index=index_name, body=mapping) is always executed.

γ‚Ήγ‚―γƒͺγƒΌγƒ³γ‚·γƒ§γƒƒγƒˆ 2023-04-25 9 07 45

engineer-matsuo avatar Apr 25 '23 00:04 engineer-matsuo

I copied and customized opensearch_vector_search to use as a temporary fix. I eliminated the process of trying to create a new index each time in the from_texts function and modified the _bulk_ingest_embeddingsk function with reference to elastic_vector_search.py. It seems that this is probably opensearch-specific and has been fixed in elastic_vector_search.

engineer-matsuo avatar Apr 25 '23 00:04 engineer-matsuo

I ran into the same issue, I cannot push new documents into a previously created index because from_text always tries to create the index which throws an exception if it already exists.

Reference:

https://github.com/hwchase17/langchain/blob/master/langchain/vectorstores/opensearch_vector_search.py#L535

russellballestrini avatar Apr 25 '23 14:04 russellballestrini

@russellballestrini That is exactly where I am coming from as well. In addition to that problem, the data in the opensearch is also float type and not vector type, so it is not searchable. Now that I know the cause, I am considering submitting a pull request if possible.

engineer-matsuo avatar Apr 26 '23 14:04 engineer-matsuo

Hi, @engineer-matsuo! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding of the issue, you are experiencing some problems with the from_documents function in OpenSearch. It seems that there is random numbering of indexes and duplicate index creation, resulting in an error. There have been some discussions in the comments about potential fixes and workarounds, including modifying the _bulk_ingest_embeddingsk function. Another user has also encountered a similar issue and provided a reference to their own code. You mentioned that you are considering submitting a pull request to address the issue.

Before we proceed, we would like to confirm if this issue is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!

dosubot[bot] avatar Sep 17 '23 17:09 dosubot[bot]