llm-graph-builder
llm-graph-builder copied to clipboard
Optimization for loading big embedding models into GPU
The backend codes in ~/src/shared/common_fn.py of load_embedding_model() will lead to 2 instance of embedding model for each worker, this will lead to big memory usage, for example, an all-MiniLM-L6-v2 instance will take about 90MB, for those big embedding models such as BAAI/bge-m3, which is about 3GB(the fp16 ollama version is about 1GB ), this is a big problem. So I load the embedding model with Ollama, just one instance running on GPU outside the backend container. The IP 172.17.0.1 is mapped to host.docker.internal in backend container, for something I don't know, I can access Ollama through IP, but not hostname with proxy is set in container.
from langchain_ollama import OllamaEmbeddings
def load_embedding_model(embedding_model_name: str):
if embedding_model_name == "openai":
embeddings = OpenAIEmbeddings()
dimension = 1536
logging.info(f"Embedding: Using OpenAI Embeddings , Dimension:{dimension}")
elif embedding_model_name == "vertexai":
embeddings = VertexAIEmbeddings(
model="textembedding-gecko@003"
)
dimension = 768
logging.info(f"Embedding: Using Vertex AI Embeddings , Dimension:{dimension}")
# Added by Jean 2025/01/26
elif embedding_model_name == "BAAI/bge-m3":
embeddings = OllamaEmbeddings(model="bge-m3",base_url="http://172.17.0.1:11434")
dimension = 1024
logging.info(f"Embedding: Using Ollama BAAI/bge-m3 , Dimension:{dimension}")
else:
embeddings = HuggingFaceEmbeddings(
model_name="all-MiniLM-L6-v2"#, cache_folder="/embedding_model"
)
dimension = 384
logging.info(f"Embedding: Using Langchain HuggingFaceEmbeddings , Dimension:{dimension}")
return embeddings, dimension
Need to add this two packages to backend's ~/requirrements.txt.
langchain-ollama==0.2.1
datasets==3.1.0
Best regards Jean