llama_index icon indicating copy to clipboard operation
llama_index copied to clipboard

[Feature Request]: elasticsearch retriever wanted

Open dl942702882 opened this issue 1 year ago • 10 comments

Feature Description

i wanna use elasticsearch retriever to do keyword retrieve, what do i need? any existing Retriever suggested?

Reason

any existing Retriever suggested?

Value of Feature

No response

dl942702882 avatar Apr 03 '24 02:04 dl942702882

Hey @dl942702882, great to see you back! 🚀

Based on the context provided, the LlamaIndex library has a class called ElasticsearchStore that can be used for keyword retrieval in Elasticsearch. This class is designed to connect to an Elasticsearch instance and supports both vector and keyword search.

Here are the key points and requirements for implementing keyword retrieval using this approach:

  1. Elasticsearch Version: The ElasticsearchStore class requires Elasticsearch 8.9.0 or higher. This is crucial as the features and API compatibility might differ in earlier versions.

  2. Dependencies: You need to install the llama-index-vector-stores-elasticsearch and llama-index packages. These can be installed via pip as shown below:

    %pip install llama-index-vector-stores-elasticsearch
    !pip install llama-index
    
  3. Elasticsearch Setup: You can set up Elasticsearch either locally using Docker or through Elastic Cloud. For local setup, the provided Docker command simplifies the process. For Elastic Cloud, you need to sign up for a free trial and configure the connection using the cloud ID, and optionally, an API key, username, and password.

  4. Connecting to Elasticsearch: The ElasticsearchStore class requires parameters such as index_name and optionally es_url, es_cloud_id, es_api_key, es_user, and es_password for connecting to an Elasticsearch instance. Examples for both local and cloud connections are provided.

  5. Keyword Search Implementation: While the context primarily focuses on setting up and connecting to Elasticsearch, implementing keyword search would typically involve using Elasticsearch's query DSL to perform searches on text fields. The ElasticsearchStore class likely abstracts some of these details, but specific methods for keyword search are not detailed in the provided context.

  6. Custom Filters and Queries: The context mentions the ability to use custom filters and override the query function for more advanced search capabilities, including keyword searches. This suggests that for keyword retrieval, one might need to customize the query sent to Elasticsearch to match specific keywords.

For keyword search, the ElasticsearchStore class provides two methods: query for synchronous keyword search and aquery for asynchronous keyword search. Here are examples of how to use these methods:

  • For query:

    res = es_store.query(
        VectorStoreQuery(
            query_str="lorem",
            mode=VectorStoreQueryMode.TEXT_SEARCH,
            similarity_top_k=1,
        )
    )
    
  • For aquery:

    res = await es_store.aquery(
        VectorStoreQuery(
            query_str="lorem",
            mode=VectorStoreQueryMode.TEXT_SEARCH,
            similarity_top_k=1,
        )
    )
    

You can find more details about these methods in the test_vector_stores_elasticsearch.py file.

Please note that the ElasticsearchStore class is part of the llama-index-vector-stores-elasticsearch package, which is a separate package from the main llama-index package. Make sure to install both packages to use this class.

I hope this information helps you in implementing keyword retrieval in your project. If you have any more questions or need further clarification, feel free to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] avatar Apr 03 '24 02:04 dosubot[bot]

use elasticsearch not for vector search, but use its inverting index for keyword retrieve

dl942702882 avatar Apr 04 '24 13:04 dl942702882

How about setting alpha=0.0?

query_engine = index.as_query_engine(
    vector_store_query_mode="hybrid", similarity_top_k=2, alpha=0.0
)

adrian539859 avatar Apr 11 '24 07:04 adrian539859

@dl942702882 From version 0.2.0 of llama-index-vector-stores-elasticsearch you can do keyword retrieval, see https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/vector_stores/ElasticsearchIndexDemo.ipynb

maxjakob avatar May 16 '24 09:05 maxjakob

@maxjakob 您好,请问这个必须要求ES的高级版才可以使用是吗,我更新了llama-index-vector-stores-elasticsearch,在混合搜索的时候 ,一直出错“elasticsearch.AuthorizationException: AuthorizationException(403, 'security_exception', 'current license is non-compliant for [Reciprocal Rank Fusion (RRF)]')”

yzgrfsy avatar May 22 '24 08:05 yzgrfsy

@maxjakob 您好,请问这个必须要求ES的高级版才可以使用是吗,我更新了llama-index-vector-stores-elasticsearch,在混合搜索的时候 ,一直出错“elasticsearch.AuthorizationException: AuthorizationException(403, 'security_exception', 'current license is non-compliant for [Reciprocal Rank Fusion (RRF)]')”

You can also use a 30-day trial version of ES with xpack.license.self_generated.type=trial. See https://docs.llamaindex.ai/en/stable/examples/vector_stores/ElasticsearchIndexDemo/

adrian539859 avatar May 22 '24 08:05 adrian539859

@yzgrfsy Keyword-only retrieval (BM25) is a regular feature of Elasticsearch. Hybrid search (dense vector + BM25 retrieval, combining the results using RRF) is a licensed feature that you can trial with the parameter that @adrian539859 shared or with a trial in Elastic Cloud.

maxjakob avatar May 22 '24 08:05 maxjakob

@dl942702882 thank you for your answer。

yzgrfsy avatar May 22 '24 08:05 yzgrfsy

@maxjakob thank you for your answer。我实现创建了index到ES中,现在按照LLamaindex的官网指导 https://docs.llamaindex.ai/en/stable/examples/retrievers/bm25_retriever/ 创建自定义的检索器(结合矢量检索器和 BM25 检索器),实现 Hybrid Retriever :

class HybridRetriever(BaseRetriever): def init(self, vector_retriever, bm25_retriever): self.vector_retriever = vector_retriever self.bm25_retriever = bm25_retriever super().init()

def _retrieve(self, query, **kwargs):
    bm25_nodes = self.bm25_retriever.retrieve(query, **kwargs)
    vector_nodes = self.vector_retriever.retrieve(query, **kwargs)

    # combine the two lists of nodes
    all_nodes = []
    node_ids = set()
    for n in bm25_nodes + vector_nodes:
        if n.node.node_id not in node_ids:
            all_nodes.append(n)
            node_ids.add(n.node.node_id)
    return all_nodes

但是我遇到一个问题,我需要实例化两个检索器。其中:vector_retriever 是这样实例化: vector_store = ElasticsearchStore( index_name=es_index_name, es_url=CONS_ELASTIC_SEARCH_URL )

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_vector_store(embed_model=Settings.embed_model, vector_store=vector_store,
                                           storage_context=storage_context)

vector_retriever = index.as_retriever(similarity_top_k=10) 。但是:实例化bm25_retriever bm25_retriever = BM25Retriever.from_defaults(index=index, similarity_top_k=10)或者 bm25_retriever = BM25Retriever.from_defaults(nodes=nodes, similarity_top_k=10)均出错,请问您知道如何从已保存index的ES中得到nodes 吗

yzgrfsy avatar May 22 '24 09:05 yzgrfsy

@yzgrfsy There is a simpler way to do hybrid retrieval, see the documentation:

from llama_index.vector_stores.elasticsearch import AsyncDenseVectorStrategy

hybrid_store = ElasticsearchStore(
    es_url="http://localhost:9200",
    index_name="xyz",
    retrieval_strategy=AsyncDenseVectorStrategy(hybrid=True),
)

storage_context = StorageContext.from_defaults(vector_store=hybrid_store)
index = VectorStoreIndex(nodes, storage_context=storage_context)
retriever = index.as_retriever()
print(retriever.retrieve(query))

This will use dense vector retrieval and BM25 and combine the results using RRF.

maxjakob avatar May 23 '24 14:05 maxjakob