Verba icon indicating copy to clipboard operation
Verba copied to clipboard

Instruction: How to add BAAI/bge-m3 embedder

Open bakongi opened this issue 10 months ago • 15 comments

Hi everyone. This is working sample how to add BAAI/bge-m3 embedder to Verba.

  1. Create copy of MiniLMEmbedder.py file and rename it to "BGEM3Embedder.py" in goldenverba/components/embedding
  2. Make changes in the file: rename MiniLMEmbedder class to BGEM3Embedder and so on:
from tqdm import tqdm
from wasabi import msg
from weaviate import Client

from goldenverba.components.embedding.interface import Embedder
from goldenverba.components.reader.document import Document


class BGEM3Embedder(Embedder):
    """
    BGEM3Embedder for Verba.
    """

    def __init__(self):
        super().__init__()
        self.name = "BGEM3Embedder"
        self.requires_library = ["torch", "transformers"]
        self.description = "Embeds and retrieves objects using SentenceTransformer's BAAI/bge-m3 model"
        self.vectorizer = "BAAI/bge-m3"
        self.model = None
        self.tokenizer = None
        try:
            import torch
            from transformers import AutoModel, AutoTokenizer

            def get_device():
                if torch.cuda.is_available():
                    return torch.device("cuda")
                elif torch.backends.mps.is_available():
                    return torch.device("mps")
                else:
                    return torch.device("cpu")

            self.device = get_device()

            self.model = AutoModel.from_pretrained(
                "BAAI/bge-m3", device_map=self.device
            )
            self.tokenizer = AutoTokenizer.from_pretrained(
                "BAAI/bge-m3", device_map=self.device
            )
            self.model = self.model.to(self.device)
...
  1. In manager.py in goldenverba/components/embedding make this changes:
from goldenverba.components.embedding.MiniLMEmbedder import MiniLMEmbedder
from goldenverba.components.embedding.BGEM3Embedder import BGEM3Embedder
from goldenverba.components.reader.document import Document


class EmbeddingManager:
    def __init__(self):
        self.embedders: dict[str, Embedder] = {
            "MiniLMEmbedder": MiniLMEmbedder(),
            "BGEM3Embedder": BGEM3Embedder(),
            "ADAEmbedder": ADAEmbedder(),
            "CohereEmbedder": CohereEmbedder(),
        }

...
  1. Make changes in goldenverba/components/schema/schema_generation.py:
VECTORIZERS = {"text2vec-openai", "text2vec-cohere"}  # Needs to match with Weaviate modules
EMBEDDINGS = {"MiniLM", "BAAI/bge-m3"}  # Custom Vectors
  1. Done! Start Verba!

P.S. If you want to use English specific model like "BAAI/bge-large-en" just use "BAAI/bge-large-en" instead of "BAAI/bge-m3" and use appropriate names for files.

bakongi avatar Mar 30 '24 07:03 bakongi