fastembed [Bug]: Parallel Embedding is not working on Windows Servers

What happened?

I am trying to encode my dataset with multiple CUDA GPU but only one GPU is working

What is the expected behaviour?

all specified 4 GPU must work

A minimal reproducible example

embedding_model = LateInteractionTextEmbedding("jinaai/jina-colbert-v2",cuda=True,device_ids=[0,1,2,3])

descriptions_embeddings = list(embedding_model.embed(documents,parallel=4))

What Python version are you on? e.g. python --version

python3.11

FastEmbed version

v0.4.2

What os are you seeing the problem on?

No response

Relevant stack traces and/or logs

No response

Nov 25 '24 15:11 abdelkareemkobo

hi @abdelkareemkobo,

parallel=4 does not span all the data across all available gpu by default, you need to initialize your model with cuda and device_ids params, like

LateInteractionTextEmbedding(
        model_name=model_name,
        cuda=args.use_cuda,
        device_ids=device_ids,
        lazy_load=lazy_load
)

Dec 04 '24 16:12 joein

Thanks @joein but I'm running in the same issue but in ubuntu. I'm not able to select the cuda devices to run indexing. It is not clear in the docs how to do indexing on multiple gpus on the same machine.

Here is snippet to reproduce using python 3.12

import time
from dataclasses import dataclass
from typing import Any
import os

from qdrant_client import QdrantClient
from datasets import load_dataset
from fastembed import TextEmbedding


@dataclass
class CollectionItem:
    text: str
    metadata: dict[str, Any] = None

    def __post_init__(self):
        if self.metadata is None:
            self.metadata = {'text': self.text}


@dataclass
class CollectionItemPool:
    items: list[CollectionItem]
    docs: list[str] = None
    metadata: list[dict] = None

    def __post_init__(self):
        if self.docs is None:
            self.docs = [i.text for i in self.items]

        if self.metadata is None:
            self.metadata = [i.metadata for i in self.items]


def prepare_dataset(limit: int = None) -> CollectionItemPool:
    en_ds = load_dataset("allenai/c4", "en", split='train', streaming=True)

    if limit is not None:
        assert isinstance(limit, int), (
            f'`limit` has to be integer got {type(limit)}')

        ds = en_ds
        # ds = en_ds.select(range(limit))

        items: CollectionItem = []
        for idx, ds_item in enumerate(ds):
            if idx == limit:
                break

            item = CollectionItem(text=ds_item['text'])
            items.append(item)

        return CollectionItemPool(items=items)
    return None


if __name__ == '__main__':
    # setting cuda devices
    # os.environ["CUDA_VISIBLE_DEVICES"] = "1,2"

    # Initialize the client
    client = QdrantClient(":memory:")  # or QdrantClient(path="path/to/db")
    embedding_model_gpu = TextEmbedding(
        model_name="intfloat/multilingual-e5-large",
        providers=["CUDAExecutionProvider"],
        device_ids=[1, 2, 3],
        cuda=True,
        lazy_load=True
        # model_name="BAAI/bge-big-en-v1.5", providers=["CUDAExecutionProvider"]
    )
    print('Base Class')
    print(embedding_model_gpu.__class__.__bases__)
    # print(embedding_model_gpu.model.model.get_providers())
    print('Done loading embedding model on GPU')

    print('Loading Dataset')
    items_pool = prepare_dataset(limit=1024)
    print('Done loading dataset')

    start_idx_time = time.time()
    print('Start Indexing ..')
    # every embedding is numpy array oject
    embeds = embedding_model_gpu.embed(items_pool.docs, batch_size=256)
    end_idx_time = time.time()
    for embed in embeds:
        print(type(embed))
        print(embed.shape)
        # print(embed)  # numpy array
        break
    print(f'End Indexing in {end_idx_time - start_idx_time:4f}')

Here I set device_ids to [1, 2, 3], but fastembed still running on device 0. If you increase the batch size we will get

Failed to allocate memory for requested buffer of size 17179869184

Dec 05 '24 17:12 Abdullahaml1

@Abdullahaml1 Please ensure that the parallel argument in .embed is == len(device_ids). In your example its 3. The reason for that, parallel enables multi-GPU support by spawning child processes for each GPU specified in device_ids. To ensure proper utilization, the value of parallel must match the number of GPUs in device_ids. If using a single GPU, this parameter is not necessary. It is also required to use the cuda=True argument when configuring the model, without explicitly specifying providers. cuda and providers are mutually exclusive parameters.

Dec 17 '24 10:12 hh-space-invader

fastembed fastembed copied to clipboard

[Bug]: Parallel Embedding is not working on Windows Servers

What happened?

What is the expected behaviour?

A minimal reproducible example

What Python version are you on? e.g. python --version

FastEmbed version

What os are you seeing the problem on?

Relevant stack traces and/or logs

fastembed
fastembed copied to clipboard