fastembed
fastembed copied to clipboard
[Bug]: Parallel Embedding is not working on Windows Servers
What happened?
I am trying to encode my dataset with multiple CUDA GPU but only one GPU is working
What is the expected behaviour?
all specified 4 GPU must work
A minimal reproducible example
embedding_model = LateInteractionTextEmbedding("jinaai/jina-colbert-v2",cuda=True,device_ids=[0,1,2,3])
descriptions_embeddings = list(embedding_model.embed(documents,parallel=4))
What Python version are you on? e.g. python --version
python3.11
FastEmbed version
v0.4.2
What os are you seeing the problem on?
No response
Relevant stack traces and/or logs
No response
hi @abdelkareemkobo,
parallel=4 does not span all the data across all available gpu by default, you need to initialize your model with
cuda and device_ids params, like
LateInteractionTextEmbedding(
model_name=model_name,
cuda=args.use_cuda,
device_ids=device_ids,
lazy_load=lazy_load
)
Thanks @joein but I'm running in the same issue but in ubuntu. I'm not able to select the cuda devices to run indexing. It is not clear in the docs how to do indexing on multiple gpus on the same machine.
Here is snippet to reproduce using python 3.12
import time
from dataclasses import dataclass
from typing import Any
import os
from qdrant_client import QdrantClient
from datasets import load_dataset
from fastembed import TextEmbedding
@dataclass
class CollectionItem:
text: str
metadata: dict[str, Any] = None
def __post_init__(self):
if self.metadata is None:
self.metadata = {'text': self.text}
@dataclass
class CollectionItemPool:
items: list[CollectionItem]
docs: list[str] = None
metadata: list[dict] = None
def __post_init__(self):
if self.docs is None:
self.docs = [i.text for i in self.items]
if self.metadata is None:
self.metadata = [i.metadata for i in self.items]
def prepare_dataset(limit: int = None) -> CollectionItemPool:
en_ds = load_dataset("allenai/c4", "en", split='train', streaming=True)
if limit is not None:
assert isinstance(limit, int), (
f'`limit` has to be integer got {type(limit)}')
ds = en_ds
# ds = en_ds.select(range(limit))
items: CollectionItem = []
for idx, ds_item in enumerate(ds):
if idx == limit:
break
item = CollectionItem(text=ds_item['text'])
items.append(item)
return CollectionItemPool(items=items)
return None
if __name__ == '__main__':
# setting cuda devices
# os.environ["CUDA_VISIBLE_DEVICES"] = "1,2"
# Initialize the client
client = QdrantClient(":memory:") # or QdrantClient(path="path/to/db")
embedding_model_gpu = TextEmbedding(
model_name="intfloat/multilingual-e5-large",
providers=["CUDAExecutionProvider"],
device_ids=[1, 2, 3],
cuda=True,
lazy_load=True
# model_name="BAAI/bge-big-en-v1.5", providers=["CUDAExecutionProvider"]
)
print('Base Class')
print(embedding_model_gpu.__class__.__bases__)
# print(embedding_model_gpu.model.model.get_providers())
print('Done loading embedding model on GPU')
print('Loading Dataset')
items_pool = prepare_dataset(limit=1024)
print('Done loading dataset')
start_idx_time = time.time()
print('Start Indexing ..')
# every embedding is numpy array oject
embeds = embedding_model_gpu.embed(items_pool.docs, batch_size=256)
end_idx_time = time.time()
for embed in embeds:
print(type(embed))
print(embed.shape)
# print(embed) # numpy array
break
print(f'End Indexing in {end_idx_time - start_idx_time:4f}')
Here I set device_ids to [1, 2, 3], but fastembed still running on device 0. If you increase the batch size we will get
Failed to allocate memory for requested buffer of size 17179869184
@Abdullahaml1 Please ensure that the parallel argument in .embed is == len(device_ids). In your example its 3.
The reason for that, parallel enables multi-GPU support by spawning child processes for each GPU specified in device_ids. To ensure proper utilization, the value of parallel must match the number of GPUs in device_ids. If using a single GPU, this parameter is not necessary.
It is also required to use the cuda=True argument when configuring the model, without explicitly specifying providers.
cuda and providers are mutually exclusive parameters.