llama_index [Question]: How to create indexes using local LLM and embedding from a local server path

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

I have a server that is not connected to the internet. so I cannot run llama_index with download any LLM and embedding in OpenAI or huggingface.co website. I've noticed that all the provided sample codes require an internet connection to download LLM and embedding files, and there are no examples that read LLM and embedding files from a local server path.

Who can tell me how to create indexes using local LLM and embedding from a local server path.

Jan 19 '24 07:01 Gzj369

🤖

To create indexes using local LLM and embedding files from a local server path with LlamaIndex, you can use the OpenLLM class provided in the openllm.py file. The OpenLLM class has a model_id field which can be either a pretrained ID or a local path. This is synonymous to HuggingFace's '.from_pretrained' first argument.

Here is an example of how you can do it:

from llama_index.llms.openllm import OpenLLM

# Initialize the OpenLLM class with the local path to your model
llm = OpenLLM(model_id="/path/to/your/local/model")

# Now you can use the `llm` object to create indexes and perform other operations

In the above code, replace "/path/to/your/local/model" with the actual path to your local model. The OpenLLM class will load the model from this path and you can use it to create indexes and perform other operations.

Please note that the server not being connected to the internet will not affect this process as long as the model files are available locally.

The OpenLLM class handles the loading of local models in its constructor (__init__ method). It uses the openllm.LLM class from the openllm library to load the model. The model ID, which can be a pretrained ID or a local path, is passed as an argument to the openllm.LLM constructor. Other optional parameters such as model version, model tag, prompt template, backend, quantization method, serialization method, and trust remote code flag can also be passed.

Here is the relevant code:

def __init__(
        self,
        model_id: str,
        model_version: Optional[str] = None,
        model_tag: Optional[str] = None,
        prompt_template: Optional[str] = None,
        backend: Optional[Literal["vllm", "pt"]] = None,
        *args: Any,
        quantize: Optional[Literal["awq", "gptq", "int8", "int4", "squeezellm"]] = None,
        serialization: Literal["safetensors", "legacy"] = "safetensors",
        trust_remote_code: bool = False,
        callback_manager: Optional[CallbackManager] = None,
        system_prompt: Optional[str] = None,
        messages_to_prompt: Optional[Callable[[Sequence[ChatMessage]], str]] = None,
        completion_to_prompt: Optional[Callable[[str], str]] = None,
        pydantic_program_mode: PydanticProgramMode = PydanticProgramMode.DEFAULT,
        **attrs: Any,
    ):
        try:
            import openllm
        except ImportError:
            raise ImportError(
                "OpenLLM is not installed. Please install OpenLLM via `pip install openllm`"
            )
        self._llm = openllm.LLM[Any, Any](
            model_id,
            model_version=model_version,
            model_tag=model_tag,
            prompt_template=prompt_template,
            system_message=system_prompt,
            backend=backend,
            quantize=quantize,
            serialisation=serialization,
            trust_remote_code=trust_remote_code,
            embedded=True,
            **attrs,
        )

In this code, self._llm is an instance of the openllm.LLM class, which represents the loaded model.

You can find more details in the llama_index/llms/openllm.py file in the LlamaIndex repository.

Sources

llama_index/llms/openllm.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Jan 19 '24 07:01 dosubot[bot]

@Gzj369 dosu's reco on using OpenLLM seems like a good one. Does that at least cover the LLM aspect of your use case? For embeddings, would you be able to use Ollama?

Jan 19 '24 21:01 nerdai

@Gzj369 dosu's reco on using OpenLLM seems like a good one. Does that at least cover the LLM aspect of your use case? For embeddings, would you be able to use Ollama?

Glad to see your answer. What I most hope is to use LLM and embeddings on my local Linux server, because I have downloaded the corresponding model bin file offline and placed it in the local path of my server.

I have many documents (More than 10,000) to be initialized for DocumentSummaryIndex. Currently, I use the API to call LLM, but during initialization the API connection timeout always occurs. What should I do @nerdai

Jan 22 '24 12:01 Gzj369

🤖

To create indexes using local LLM and embeddings from a local server path for more than 10,000 documents without experiencing API connection timeout, you can use the ElasticsearchEmbedding class provided in the LlamaIndex framework.

You can instantiate the ElasticsearchEmbedding class using either an existing Elasticsearch connection or Elasticsearch credentials.

Here is an example of how you can do this:

from elasticsearch import Elasticsearch
from llama_index.embeddings import ElasticsearchEmbedding

# Define the model ID and input field name (if different from default)
model_id = "your_model_id"
# Optional, only if different from 'text_field'
input_field = "your_input_field"

# Create Elasticsearch connection
es_connection = Elasticsearch(hosts=["localhost:9200"], basic_auth=("user", "password"))

# Instantiate ElasticsearchEmbedding using the existing connection
embeddings = ElasticsearchEmbedding.from_es_connection(
    model_id,
    es_connection,
    input_field=input_field,
)

Or using Elasticsearch credentials:

from llama_index.embeddings import ElasticsearchEmbedding

# Define the model ID and input field name (if different from default)
model_id = "your_model_id"
# Optional, only if different from 'text_field'
input_field = "your_input_field"

embeddings = ElasticsearchEmbedding.from_credentials(
    model_id,
    input_field=input_field,
    es_url="foo",
    es_username="bar",
    es_password="baz",
)

To avoid API connection timeout, you can use the asynchronous programming model provided by Python. This will allow you to send multiple requests to the server without waiting for the response of each request. This can be done using the _aget_query_embedding method in the ElasticsearchEmbedding class.

However, the context does not provide enough information on how to handle more than 10,000 documents. You might need to split your documents into smaller batches and process each batch separately to avoid overloading the server and experiencing API connection timeout.

To configure the LlamaIndex framework to use local LLM and embeddings files instead of downloading them from the internet, you can specify the cache_folder parameter when initializing the HuggingFaceEmbedding class. This parameter should point to the directory where your local files are stored.

Here is an example of how to do this:

hugging_face_embedding = HuggingFaceEmbedding(
    model_name='your_model_name',
    tokenizer_name='your_tokenizer_name',
    cache_folder='/path/to/your/local/files'
)

In this code, replace 'your_model_name' and 'your_tokenizer_name' with the names of your local model and tokenizer files, and replace '/path/to/your/local/files' with the path to the directory where these files are stored.

This works because the HuggingFaceEmbedding class uses the AutoModel.from_pretrained and AutoTokenizer.from_pretrained methods from the transformers library to load the model and tokenizer. These methods first check the cache_dir parameter (which is set to the value of cache_folder) for the specified files, and only download the files from the internet if they are not found in the cache directory.

You can find more information about these classes and methods in the llama_index/embeddings/elasticsearch.py and llama_index/embeddings/huggingface.py files in the LlamaIndex repository.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Jan 23 '24 21:01 dosubot[bot]

🤖

To create indexes using local LLM and embeddings from a local server path for more than 10,000 documents without experiencing API connection timeout, you can use the ElasticsearchEmbedding class provided in the LlamaIndex framework.

You can instantiate the ElasticsearchEmbedding class using either an existing Elasticsearch connection or Elasticsearch credentials.

Here is an example of how you can do this:

from elasticsearch import Elasticsearch
from llama_index.embeddings import ElasticsearchEmbedding

# Define the model ID and input field name (if different from default)
model_id = "your_model_id"
# Optional, only if different from 'text_field'
input_field = "your_input_field"

# Create Elasticsearch connection
es_connection = Elasticsearch(hosts=["localhost:9200"], basic_auth=("user", "password"))

# Instantiate ElasticsearchEmbedding using the existing connection
embeddings = ElasticsearchEmbedding.from_es_connection(
    model_id,
    es_connection,
    input_field=input_field,
)

Or using Elasticsearch credentials:

from llama_index.embeddings import ElasticsearchEmbedding

# Define the model ID and input field name (if different from default)
model_id = "your_model_id"
# Optional, only if different from 'text_field'
input_field = "your_input_field"

embeddings = ElasticsearchEmbedding.from_credentials(
    model_id,
    input_field=input_field,
    es_url="foo",
    es_username="bar",
    es_password="baz",
)

To avoid API connection timeout, you can use the asynchronous programming model provided by Python. This will allow you to send multiple requests to the server without waiting for the response of each request. This can be done using the _aget_query_embedding method in the ElasticsearchEmbedding class.

However, the context does not provide enough information on how to handle more than 10,000 documents. You might need to split your documents into smaller batches and process each batch separately to avoid overloading the server and experiencing API connection timeout.

To configure the LlamaIndex framework to use local LLM and embeddings files instead of downloading them from the internet, you can specify the cache_folder parameter when initializing the HuggingFaceEmbedding class. This parameter should point to the directory where your local files are stored.

Here is an example of how to do this:

hugging_face_embedding = HuggingFaceEmbedding(
    model_name='your_model_name',
    tokenizer_name='your_tokenizer_name',
    cache_folder='/path/to/your/local/files'
)

In this code, replace 'your_model_name' and 'your_tokenizer_name' with the names of your local model and tokenizer files, and replace '/path/to/your/local/files' with the path to the directory where these files are stored.

This works because the HuggingFaceEmbedding class uses the AutoModel.from_pretrained and AutoTokenizer.from_pretrained methods from the transformers library to load the model and tokenizer. These methods first check the cache_dir parameter (which is set to the value of cache_folder) for the specified files, and only download the files from the internet if they are not found in the cache directory.

You can find more information about these classes and methods in the llama_index/embeddings/elasticsearch.py and llama_index/embeddings/huggingface.py files in the LlamaIndex repository.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Jan 23 '24 21:01 dosubot[bot]

🤖

To create indexes using local LLM and embeddings from a local server path for more than 10,000 documents without experiencing API connection timeout, you can use the ElasticsearchEmbedding class provided in the LlamaIndex framework.

You can instantiate the ElasticsearchEmbedding class using either an existing Elasticsearch connection or Elasticsearch credentials.

Here is an example of how you can do this:

from elasticsearch import Elasticsearch
from llama_index.embeddings import ElasticsearchEmbedding

# Define the model ID and input field name (if different from default)
model_id = "your_model_id"
# Optional, only if different from 'text_field'
input_field = "your_input_field"

# Create Elasticsearch connection
es_connection = Elasticsearch(hosts=["localhost:9200"], basic_auth=("user", "password"))

# Instantiate ElasticsearchEmbedding using the existing connection
embeddings = ElasticsearchEmbedding.from_es_connection(
    model_id,
    es_connection,
    input_field=input_field,
)

Or using Elasticsearch credentials:

from llama_index.embeddings import ElasticsearchEmbedding

# Define the model ID and input field name (if different from default)
model_id = "your_model_id"
# Optional, only if different from 'text_field'
input_field = "your_input_field"

embeddings = ElasticsearchEmbedding.from_credentials(
    model_id,
    input_field=input_field,
    es_url="foo",
    es_username="bar",
    es_password="baz",
)

To avoid API connection timeout, you can use the asynchronous programming model provided by Python. This will allow you to send multiple requests to the server without waiting for the response of each request. This can be done using the _aget_query_embedding method in the ElasticsearchEmbedding class.

However, the context does not provide enough information on how to handle more than 10,000 documents. You might need to split your documents into smaller batches and process each batch separately to avoid overloading the server and experiencing API connection timeout.

To configure the LlamaIndex framework to use local LLM and embeddings files instead of downloading them from the internet, you can specify the cache_folder parameter when initializing the HuggingFaceEmbedding class. This parameter should point to the directory where your local files are stored.

Here is an example of how to do this:

hugging_face_embedding = HuggingFaceEmbedding(
    model_name='your_model_name',
    tokenizer_name='your_tokenizer_name',
    cache_folder='/path/to/your/local/files'
)

In this code, replace 'your_model_name' and 'your_tokenizer_name' with the names of your local model and tokenizer files, and replace '/path/to/your/local/files' with the path to the directory where these files are stored.

This works because the HuggingFaceEmbedding class uses the AutoModel.from_pretrained and AutoTokenizer.from_pretrained methods from the transformers library to load the model and tokenizer. These methods first check the cache_dir parameter (which is set to the value of cache_folder) for the specified files, and only download the files from the internet if they are not found in the cache directory.

You can find more information about these classes and methods in the llama_index/embeddings/elasticsearch.py and llama_index/embeddings/huggingface.py files in the LlamaIndex repository.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Jan 23 '24 21:01 dosubot[bot]

llama_index llama_index copied to clipboard

[Question]: How to create indexes using local LLM and embedding from a local server path

Question Validation

Question

Sources

Sources

Sources

Sources

llama_index
llama_index copied to clipboard