mteb Passing language codes to model when encoding

I am planning to evaluate a few multilingual encoders with MTEB. Some of them (SONAR, for example) require explicitly specifying the input language when encoding a text. For monolingual tasks, this is not a problem: I can specify the language in advance, before running the task. However, there are cross-lingual tasks, such as BUCC and Tatoeba, which require processing several languages alternatively, so I cannot specify a single language before running a task.

So I wonder: is it possible to pass the language code, along with the other inputs, to the model.encode method? This would enable evaluating language-dependent models correctly.

I probably could do this on my internal clone of MTEB, but I am hesitating about committing my code to the public version, because the encode method in SentenceTransformer (which is the main type of models evaluated here) does not support extra arguments (https://github.com/UKPLab/sentence-transformers/blob/v2.3-release/sentence_transformers/SentenceTransformer.py#L220). However, for the reproducibility purposes, it would still be nice to have my evaluation working with the public MTEB code.

What would you recommend?

Jan 24 '24 20:01 avidale

I think it would be great to have that possibility in MTEB! I see three options:

We could use the Python standard library inspect to check the encode signature and if it has a language kwarg provide it
We could have an additional kwarg in the .evaluate func sth like pass_language or so that if activated will pass that kwarg
We could check if the model has a method encode_with_lang or sth and if so use that instead of the regular encode function and pass it the lang kwarg. Similar to how encode_query is detected for retrieval.

If you don't think any of the above is good, feel free to solve it in a different way! I'd be happy to review & merge a PR for this.

Jan 24 '24 21:01 Muennighoff

@Muennighoff you might be interested in the approach of SEB here, where we allow the model to do dynamic encoding based on the task object (most notably the metadata) e.g. for instruct-type models or for the cohere embedding model.

Mar 05 '24 07:03 KennethEnevoldsen

@Muennighoff you might be interested in the approach of SEB here, where we allow the model to do dynamic encoding based on the task object (most notably the metadata) e.g. for instruct-type models or for the cohere embedding model.

It's a good point. My main worry is that it comes at the cost of simplicity 🤔

Mar 05 '24 12:03 Muennighoff

In the current approach in SEB, all of the complexity is loaded onto the model's encode function.

It shouldn't require any more complexity in MTEB compared to passing the language code (passing some information from the task onto the model encode function). I.e. pass a defined object (e.g. TaskMeta) to the model and here some models could use the language (sonar models), the task type (coheres embed), or the task description (instruct models).

Mar 05 '24 12:03 KennethEnevoldsen