Support for EuroLLM encoder
🚀 Feature
Support for more recent encoders such as the one use for EuroLLM.
Motivation
First, thanks a lot for the great work on this framework! We have been using it to train our own UnifiedMetric. However, with the limitation of 24GB of vRAM, our options are somehow limited. We get the best results with XLM-RoBERTa as encoder and pretrained models like InfoXLM or RoBERTa. Unfortunately, these models are not very recent and do not cover all the languages we use. So, I wonder if it would be possible to add support for other encoders, such as the one used in EuroLLM.
Many thanks in any case!
EuroLLM is a causal LLM so that would require some continued training as a bidirectional encoder. EuroBERT should be plug-n-play but unfortunately the "Euro" in EuroBERT is not very "european languages".
I see, thanks! I initially mentioned EuroBERT but then realized it only covers 15 languages, which is indeed far from the 24 official EU languages supported by EuroLLM. This remains the main limitation we're facing, so it would still be great to have more encoders supporting more recent pretrained models.
I actually have experimented EuroBERT and I do have the code for it. I can make the PR.
Yep, EuroBERT does not support many languages. Btw a casual decoder can also work within COMET and I have done something similar in the past. It complicates some features but for regression is totally fine to use the EOS token instead of the BOS token (which is the default for bidirectional models)