XLMRoberta support

Open Oliver-Y opened this issue 1 year ago • 1 comments

Added support for XLMRoberta model. Tested on Multilingual E5 embeddings model. It seems in the tokenizer.json of E5 a preprocessor is used but since llama.cpp doesn't support SPM preprocessors yet I put a simple workaround right before the SPM tokenizer call.

This is my first time contributing so would love feedback of any form!

[x] I have read the contributing guidelines
Self-reported review complexity: Low-Medium
- [x] Low
- [X] Medium
- [ ] High

Jul 23 '24 00:07 Oliver-Y

Might be a bug w/ tokenization. Going to take a look first

Jul 23 '24 17:07 Oliver-Y

Redundant to #8658 so closing

Jul 24 '24 07:07 Oliver-Y