Verba icon indicating copy to clipboard operation
Verba copied to clipboard

Instruction: How to add BAAI/bge-m3 embedder

Open bakongi opened this issue 1 year ago • 15 comments

Hi everyone. This is working sample how to add BAAI/bge-m3 embedder to Verba.

  1. Create copy of MiniLMEmbedder.py file and rename it to "BGEM3Embedder.py" in goldenverba/components/embedding
  2. Make changes in the file: rename MiniLMEmbedder class to BGEM3Embedder and so on:
from tqdm import tqdm
from wasabi import msg
from weaviate import Client

from goldenverba.components.embedding.interface import Embedder
from goldenverba.components.reader.document import Document


class BGEM3Embedder(Embedder):
    """
    BGEM3Embedder for Verba.
    """

    def __init__(self):
        super().__init__()
        self.name = "BGEM3Embedder"
        self.requires_library = ["torch", "transformers"]
        self.description = "Embeds and retrieves objects using SentenceTransformer's BAAI/bge-m3 model"
        self.vectorizer = "BAAI/bge-m3"
        self.model = None
        self.tokenizer = None
        try:
            import torch
            from transformers import AutoModel, AutoTokenizer

            def get_device():
                if torch.cuda.is_available():
                    return torch.device("cuda")
                elif torch.backends.mps.is_available():
                    return torch.device("mps")
                else:
                    return torch.device("cpu")

            self.device = get_device()

            self.model = AutoModel.from_pretrained(
                "BAAI/bge-m3", device_map=self.device
            )
            self.tokenizer = AutoTokenizer.from_pretrained(
                "BAAI/bge-m3", device_map=self.device
            )
            self.model = self.model.to(self.device)
...
  1. In manager.py in goldenverba/components/embedding make this changes:
from goldenverba.components.embedding.MiniLMEmbedder import MiniLMEmbedder
from goldenverba.components.embedding.BGEM3Embedder import BGEM3Embedder
from goldenverba.components.reader.document import Document


class EmbeddingManager:
    def __init__(self):
        self.embedders: dict[str, Embedder] = {
            "MiniLMEmbedder": MiniLMEmbedder(),
            "BGEM3Embedder": BGEM3Embedder(),
            "ADAEmbedder": ADAEmbedder(),
            "CohereEmbedder": CohereEmbedder(),
        }

...
  1. Make changes in goldenverba/components/schema/schema_generation.py:
VECTORIZERS = {"text2vec-openai", "text2vec-cohere"}  # Needs to match with Weaviate modules
EMBEDDINGS = {"MiniLM", "BAAI/bge-m3"}  # Custom Vectors
  1. Done! Start Verba!

P.S. If you want to use English specific model like "BAAI/bge-large-en" just use "BAAI/bge-large-en" instead of "BAAI/bge-m3" and use appropriate names for files.

bakongi avatar Mar 30 '24 07:03 bakongi

Great work! We'll look into this for the next update

thomashacker avatar Apr 11 '24 08:04 thomashacker

@bakongi I've done the same as you but I can't figure out where to choose this custom embedder in the frontend of Verba. Any suggestions please?

moncefarajdal avatar May 16 '24 09:05 moncefarajdal

@bakongi I've done the same as you but I can't figure out where to choose this custom embedder in the frontend of Verba. Any suggestions please?

How you installed verba - pip or from sources?

bakongi avatar May 16 '24 10:05 bakongi

@bakongi I installed Verba using pip install goldenverba like shown in the documentation

moncefarajdal avatar May 16 '24 10:05 moncefarajdal

@bakongi I installed Verba using pip install goldenverba like shown in the documentation

Ok. Where did you make changes? (folder path) I think you should make changes in python shared library folder where verba is installed

bakongi avatar May 16 '24 11:05 bakongi

@bakongi I make the changes exactly in the files that you mentioned. "I think you should make changes in python shared library folder where verba is installed" Can you please elaborate?

moncefarajdal avatar May 16 '24 13:05 moncefarajdal

@bakongi One more thing, the new embedding model that I added doesn't seem to be downloaded from HugginFace my guess is an api key should be configured or does sentence_transformers do the whole job? Thank you

moncefarajdal avatar May 16 '24 13:05 moncefarajdal

@bakongi One more thing, the new embedding model that I added doesn't seem to be downloaded from HugginFace my guess is an api key should be configured or does sentence_transformers do the whole job? Thank you

The location of the Python shared library folder where installed libraries are stored depends on your operating system and the environment in which Python is running. Here are the typical locations for different environments:

On Unix-like systems (Linux, macOS):

  • System-wide installations: Libraries are generally stored in:

    • /usr/lib/pythonX.Y/site-packages or /usr/local/lib/pythonX.Y/site-packages (where X.Y is your Python version, e.g., python3.9).
  • User-specific installations: If you've installed libraries using pip with the --user option:

    • ~/.local/lib/pythonX.Y/site-packages
  • Virtual environments: If you're using a virtual environment (created with venv or virtualenv), libraries are stored within the virtual environment directory:

    • <virtualenv_path>/lib/pythonX.Y/site-packages

On Windows:

  • System-wide installations: Libraries are typically found in:

    • C:\PythonXY\Lib\site-packages (where XY is your Python version, e.g., Python39).
  • User-specific installations: If you've installed libraries using pip with the --user option:

    • C:\Users\<YourUsername>\AppData\Roaming\Python\PythonXY\site-packages
  • Virtual environments: If you're using a virtual environment, libraries are stored within the virtual environment directory:

    • <virtualenv_path>\Lib\site-packages

Checking the location programmatically:

You can also check the location of installed libraries programmatically using Python:

import site
import sys

# List all site-packages directories
print(site.getsitepackages())

# List user-specific site-packages directory
print(site.getusersitepackages())

# List all paths where Python looks for packages
print(sys.path)

This code will print the paths where Python searches for libraries, including the site-packages directories.

bakongi avatar May 16 '24 13:05 bakongi

I see. I've installed Verba pip install goldenverba on a virtual environment created using python venv and it's located in the project directory. Is this correct?

moncefarajdal avatar May 16 '24 13:05 moncefarajdal

I see. I've installed Verba pip install goldenverba on a virtual environment created using python venv and it's located in the project directory. Is this correct?

When you install a Python package in a virtual environment, the package is installed within the directory structure of the virtual environment itself. This ensures that the package dependencies are isolated from the global Python environment and any other virtual environments you might have.

Here's a typical structure of a virtual environment:

<project_directory>/ ├── <venv_name>/ │ ├── bin/ # Executables and scripts (Linux/macOS) or Scripts/ (Windows) │ ├── lib/ # Libraries (Linux/macOS) or Lib/ (Windows) │ │ └── pythonX.Y/ │ │ └── site-packages/ │ │ └── goldenverba/ ├── your_project_files/ └── ...

bakongi avatar May 16 '24 14:05 bakongi

So what should I do in this case for the project to run correctly?

moncefarajdal avatar May 16 '24 14:05 moncefarajdal

So what should I do in this case for the project to run correctly?

Go to<venv_name>\Lib\site-packages\goldenverba and make nesessary changes in files in "components" folder and subfolder

or, if you downloaded sourse files and made changes there just run

pip install -e .

in your virtual anv.

bakongi avatar May 16 '24 14:05 bakongi

not sure if this is your problem @moncefarajdal but I think you need to install pip install goldenverba[huggingface]

luc42ei avatar Jun 05 '24 14:06 luc42ei

Hi everyone. This is working sample how to add BAAI/bge-m3 embedder to Verba. …

for this to show up in Verba, you also need to adjust goldenverba/components/embedding/manager.py accordingly

luc42ei avatar Jun 05 '24 14:06 luc42ei

unsubscribe

From: luc42ei Date: 2024-06-05 22:33 To: weaviate/Verba CC: Subscribed Subject: Re: [weaviate/Verba] Instruction: How to add BAAI/bge-m3 embedder (Issue #128) Hi everyone. This is working sample how to add BAAI/bge-m3 embedder to Verba. … for this to show up in Verba, you also need to adjust goldenverba/components/embedding/manager.py accordingly — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

13777469818 avatar Jun 06 '24 23:06 13777469818

We added the model to the newest release 🚀

thomashacker avatar Sep 03 '24 12:09 thomashacker