llama.cpp Add support for XLMRoberta embedding models

This adds support for XLMRoberta embedding models, such as BAAI/bge-m3. The inference is done entirely through regular BERT, and tokenization uses the new T5 Unigram work. There is some modification necessary at conversion time:

Slightly reshuffle the special token order (as is done in HF)
Shift the position embedding matrix down by 1+pad_token_id slots

The position embedding shift has the added bonus that it usually makes the embedding matrix a nice even power of 2 again. It also means we don't introduce added complexity to the inference code due to a pretty unusual design choice.

Also needed to tweak the Unigram tokenizer very slightly. Now the Unigram tokenize method only reverses the tokens it's just added, rather than the entire output list. This didn't matter before, but it's needed if you're adding a BOS token first.

@Oliver-Y I just saw that you're working on this as well. If you have the time/interest, please take a look!

[x] I have read the contributing guidelines
Self-reported review complexity:
- [X] Low
- [ ] Medium
- [ ] High

Jul 23 '24 20:07 iamlemec

Sounds good will take a look and get back to you!

Jul 23 '24 21:07 Oliver-Y

Yea, I definitely misconfigured the tokenizer in the conversion since I was largely following along with the BERT implementation without realizing the tokenizer changed. The T5 Unigram tokenizer fixed the tokenizer bug I was previously dealing with. This might be a simple question but I keep seeing Unigram, BPE, SPM, and WPM everywhere but was under the impression that Unigram and BPE were the underlying algorithms that SPM and WPM used so what exactly are the distinctions here?

For the position encodings, I added it into the llama.cpp inference under the llama_set_inputs function. Would this still be needed if modify tensors in conversion script accounts for it?

I did a preliminary test with this new conversion script on my local version but couldn't get the embeddings out of llama-embeddings to match up with the HF implementation. How did it fare with the bge-m3 model?

Jul 24 '24 01:07 Oliver-Y

@Oliver-Y Yeah, it's super confusing. They're all kind of similar, and some of the difference are only relevant at training time anyway. It looks like they really hashed it out in this Reddit thread: https://www.reddit.com/r/MachineLearning/comments/rprmq3/d_sentencepiece_wordpiece_bpe_which_tokenizer_is/

If you modify the tensors at conversion time, there's no need to change llama_set_inputs since the position IDs actually start from zero, just like usual. It also means you don't need to make a new architecture and can just use plain BERT.

I'm getting the numbers to match up between HF and here for both bge-m3 and multilingual-e5-large. In either case, make sure you're running on the same device and precision, since the numbers can diverge a good bit. Also, make sure you're doing mean pooling and normalizing for multilingual-e5-large.

Jul 24 '24 07:07 iamlemec

Probably for another PR, but performance right now is not ideal due to tokenization speed. Specifically, recomputing token_matcher and user_defined_token_matcher on each call is quite costly. It might be a good idea to pre-compute these in llama_load_vocab. Does that seem reasonable @fairydreaming?

Jul 24 '24 07:07 iamlemec

What the the performance when comparing python transformers to llama.cpp run ?

Jul 24 '24 12:07 ExtReMLapin

@ExtReMLapin Doing 128 sequences with an average length of 75 tokens, I'm getting 180ms with transformers and 25s with llama.cpp. If you do all the tokens in one big sequence (with llama-tokenize) it only takes 500ms with llama.cpp.

Jul 24 '24 17:07 iamlemec

Thanks for the answer, unless there is a typo somewhere, I expected it to be faster on llama.cpp

Jul 25 '24 04:07 ExtReMLapin

Probably for another PR, but performance right now is not ideal due to tokenization speed. Specifically, recomputing token_matcher and user_defined_token_matcher on each call is quite costly. It might be a good idea to pre-compute these in llama_load_vocab. Does that seem reasonable @fairydreaming?

@iamlemec I agree, this is definitely one of the bottlenecks that should be eliminated. Another is naive trie implementation that should be at some point replaced with something more efficient.

Jul 25 '24 07:07 fairydreaming

Is there anything preventing this from being merged?

Aug 05 '24 20:08 fairydreaming