Passing language codes to model when encoding
I am planning to evaluate a few multilingual encoders with MTEB. Some of them (SONAR, for example) require explicitly specifying the input language when encoding a text. For monolingual tasks, this is not a problem: I can specify the language in advance, before running the task. However, there are cross-lingual tasks, such as BUCC and Tatoeba, which require processing several languages alternatively, so I cannot specify a single language before running a task.
So I wonder: is it possible to pass the language code, along with the other inputs, to the model.encode method?
This would enable evaluating language-dependent models correctly.
I probably could do this on my internal clone of MTEB, but I am hesitating about committing my code to the public version, because the encode method in SentenceTransformer (which is the main type of models evaluated here) does not support extra arguments (https://github.com/UKPLab/sentence-transformers/blob/v2.3-release/sentence_transformers/SentenceTransformer.py#L220). However, for the reproducibility purposes, it would still be nice to have my evaluation working with the public MTEB code.
What would you recommend?
I think it would be great to have that possibility in MTEB! I see three options:
- We could use the Python standard library
inspectto check the encode signature and if it has a language kwarg provide it - We could have an additional kwarg in the .evaluate func sth like
pass_languageor so that if activated will pass that kwarg - We could check if the model has a method
encode_with_langor sth and if so use that instead of the regular encode function and pass it the lang kwarg. Similar to howencode_queryis detected for retrieval.
If you don't think any of the above is good, feel free to solve it in a different way! I'd be happy to review & merge a PR for this.
@Muennighoff you might be interested in the approach of SEB here, where we allow the model to do dynamic encoding based on the task object (most notably the metadata) e.g. for instruct-type models or for the cohere embedding model.
@Muennighoff you might be interested in the approach of SEB here, where we allow the model to do dynamic encoding based on the task object (most notably the metadata) e.g. for instruct-type models or for the cohere embedding model.
It's a good point. My main worry is that it comes at the cost of simplicity 🤔
In the current approach in SEB, all of the complexity is loaded onto the model's encode function.
It shouldn't require any more complexity in MTEB compared to passing the language code (passing some information from the task onto the model encode function). I.e. pass a defined object (e.g. TaskMeta) to the model and here some models could use the language (sonar models), the task type (coheres embed), or the task description (instruct models).