COMET icon indicating copy to clipboard operation
COMET copied to clipboard

Support for EuroLLM encoder

Open dmar1n opened this issue 8 months ago • 4 comments

🚀 Feature

Support for more recent encoders such as the one use for EuroLLM.

Motivation

First, thanks a lot for the great work on this framework! We have been using it to train our own UnifiedMetric. However, with the limitation of 24GB of vRAM, our options are somehow limited. We get the best results with XLM-RoBERTa as encoder and pretrained models like InfoXLM or RoBERTa. Unfortunately, these models are not very recent and do not cover all the languages we use. So, I wonder if it would be possible to add support for other encoders, such as the one used in EuroLLM.

Many thanks in any case!

dmar1n avatar May 02 '25 16:05 dmar1n

EuroLLM is a causal LLM so that would require some continued training as a bidirectional encoder. EuroBERT should be plug-n-play but unfortunately the "Euro" in EuroBERT is not very "european languages".

vince62s avatar May 02 '25 17:05 vince62s

I see, thanks! I initially mentioned EuroBERT but then realized it only covers 15 languages, which is indeed far from the 24 official EU languages supported by EuroLLM. This remains the main limitation we're facing, so it would still be great to have more encoders supporting more recent pretrained models.

dmar1n avatar May 02 '25 17:05 dmar1n

I actually have experimented EuroBERT and I do have the code for it. I can make the PR.

ricardorei avatar May 07 '25 14:05 ricardorei

Yep, EuroBERT does not support many languages. Btw a casual decoder can also work within COMET and I have done something similar in the past. It complicates some features but for regression is totally fine to use the EOS token instead of the BOS token (which is the default for bidirectional models)

ricardorei avatar May 07 '25 14:05 ricardorei