llm-graph-builder Optimization for loading big embedding models into GPU

Optimization for loading big embedding models into GPU

Open icejean opened this issue 10 months ago • 0 comments

The backend codes in ~/src/shared/common_fn.py of load_embedding_model() will lead to 2 instance of embedding model for each worker, this will lead to big memory usage, for example, an all-MiniLM-L6-v2 instance will take about 90MB, for those big embedding models such as BAAI/bge-m3, which is about 3GB(the fp16 ollama version is about 1GB ), this is a big problem. So I load the embedding model with Ollama， just one instance running on GPU outside the backend container. The IP 172.17.0.1 is mapped to host.docker.internal in backend container, for something I don't know, I can access Ollama through IP， but not hostname with proxy is set in container.

from langchain_ollama import OllamaEmbeddings

def load_embedding_model(embedding_model_name: str):
    if embedding_model_name == "openai":
        embeddings = OpenAIEmbeddings()
        dimension = 1536
        logging.info(f"Embedding: Using OpenAI Embeddings , Dimension:{dimension}")
    elif embedding_model_name == "vertexai":        
        embeddings = VertexAIEmbeddings(
            model="textembedding-gecko@003"
        )
        dimension = 768
        logging.info(f"Embedding: Using Vertex AI Embeddings , Dimension:{dimension}")

    # Added by Jean 2025/01/26
    elif embedding_model_name == "BAAI/bge-m3":
        embeddings = OllamaEmbeddings(model="bge-m3",base_url="http://172.17.0.1:11434")
        dimension = 1024
        logging.info(f"Embedding: Using Ollama BAAI/bge-m3 , Dimension:{dimension}")
        
    else:
        embeddings = HuggingFaceEmbeddings(
            model_name="all-MiniLM-L6-v2"#, cache_folder="/embedding_model"
        )
        dimension = 384
        logging.info(f"Embedding: Using Langchain HuggingFaceEmbeddings , Dimension:{dimension}")
    return embeddings, dimension

Need to add this two packages to backend's ~/requirrements.txt.

langchain-ollama==0.2.1
datasets==3.1.0

Best regards Jean

Jan 26 '25 09:01 icejean

llm-graph-builder llm-graph-builder copied to clipboard

Optimization for loading big embedding models into GPU

llm-graph-builder
llm-graph-builder copied to clipboard