llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Model Request for BAAI/bge-m3 (XLMRoberta-based Multilingual Embedding Model)

Open mofanke opened this issue 11 months ago • 1 comments

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [x] I carefully followed the README.md.
  • [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [x] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Supporting a multilingual embedding. https://huggingface.co/BAAI/bge-m3

Motivation

There are some differences between multilingual embeddings and BERT

Possible Implementation

sorry, no idea. I tried , seems model arch is same as bert ,but tokenizer is XLMRobertaTokenizer , not bertTokenizer

mofanke avatar Mar 12 '24 06:03 mofanke

Also request this model to be supported.

RoggeOhta avatar Apr 23 '24 01:04 RoggeOhta

Tried to support it, use BertModel & SPM tokenizer. https://huggingface.co/vonjack/bge-m3-gguf

Tested cosine similarity between "中国" and "中华人民共和国": bge-m3-f16: 0.9993230772798457 mxbai-embed-large-v1-f16: 0.7287733321223814

vonjackustc avatar May 04 '24 03:05 vonjackustc

I got error when using with langchain "terminate called after throwing an instance of 'std::out_of_range'"

vuminhquang avatar May 12 '24 12:05 vuminhquang

same here with llama.cpp, the full error:

libc++abi: terminating due to uncaught exception of type std::out_of_range: unordered_map::at: key not found

ciekawy avatar May 21 '24 14:05 ciekawy

the _bert version does not crash, but the the embeddings do not seem to have any sense...

ciekawy avatar May 21 '24 14:05 ciekawy

also tried to follow instructions on https://github.com/PrithivirajDamodaran/blitz-embed but after converting to gguf, getting error:

llama_model_quantize: failed to quantize: key not found in model: bert.context_length

ciekawy avatar May 21 '24 15:05 ciekawy

@vonjackustc can you share params you used with llama.cpp?

ciekawy avatar May 22 '24 17:05 ciekawy

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Jul 07 '24 01:07 github-actions[bot]

@vonjackustc Same issue with @vuminhquang and @ciekawy when running it using Ollama.

It appears to be that embedding a text containing \n (newline character) would result in the following error:

terminate called after throwing an instance of 'std::out_of_range'
  what():  _Map_base::at

This issue is also brought up here: https://huggingface.co/vonjack/bge-m3-gguf/discussions/3.

BTW, as an alternative, I am using Text Embeddings Inference to run BAAI/bge-m3 now.

theta-lin avatar Jul 13 '24 07:07 theta-lin

For embeddings I'd say most of the time it's safe if not desired to remove newlines. This may be not so obvious for longer texts but still...

ciekawy avatar Jul 13 '24 09:07 ciekawy

Tried to support it, use BertModel & SPM tokenizer. https://huggingface.co/vonjack/bge-m3-gguf

Tested cosine similarity between "中国" and "中华人民共和国": bge-m3-f16: 0.9993230772798457 mxbai-embed-large-v1-f16: 0.7287733321223814

May I ask how exactly this is accomplished?

Huoxu69 avatar Jul 26 '24 01:07 Huoxu69