mteb Add support for saving embeddings in evals

@gmittal Currently, the save_predictions flag allows for the saving of query similarity predictions to a json file. However, I wish to have a separate flag to save the embeddings computed, say as a torch tensor, pkl, or json file for example. I believe this feature would allow for greater flexibility.

https://github.com/embeddings-benchmark/mteb/blob/main/mteb/abstasks/AbsTaskRetrieval.py#L303-L306

I can work on this issue myself if needed, but I wanted to verify such a feature was within scope for mteb.

(My work focuses on retrieval, I am somewhat less familiar with the abstask setup for other tasks but from what I can tell they are similar in that saving embeddings should definitely be possible)

May 26 '24 07:05 dhruvbpai

@dhruvbpai a simple solution might be to wrap the encoder:

my_encoder = ...

class EncoderWrapper()
  def __init__(self, encoder):
    encoder = encoder
  
  def encode(sentences, **kwargs):
    # potentially load from disk here 
    emb = self.encoder.encode()
    # save to disk here
    return emb

A setup like this is fully supported within MTEB.

May 27 '24 08:05 KennethEnevoldsen

@KennethEnevoldsen Thank you for your response. My aim was to save embeddings for corpus and query embeddings separately, with the split and task in the naming convention. Unfortunately, the encoder doesn't get access to the name of the task it is being evaluated on - which I need in order to save correctly. A simple fix would be to make this available to the encoder somehow alternatively.

May 27 '24 21:05 dhruvbpai

This has actually been discussed as a part of #216 where it was pretty close to a merge but sadly never got finished.

I would suggest re-using the sentence-transformers prompt_name syntax (see sentence transformer encode args).

def encode(
         self, sentences: list[str], prompt_name: str, **kwargs: Any
     )
     """
     ...
        prompt_name: str: Optional argument. MTEB will provide the task name during encode. 
           This allows for task-specific prompts or other types of task-dependent encodes such as encoding 
           depending on e.g. clustering and retrieval.  
     """
   ...

I would be very happy to see a PR on this.

May 28 '24 11:05 KennethEnevoldsen

I'm working on this PR now and it occurred to me it may be better to pass directly the task metadata dict instead of a string, since this would maximize flexibility and reduce complexity. What are your thoughts @KennethEnevoldsen?

May 28 '24 22:05 dhruvbpai

Sorry for missing this @dhruvbpai

I would just use the task_name and the you can fetch the task using:

task = mteb.get_task("name")
meta = task.metadata

This is to keep it consistent with sentence transformers.

Jun 05 '24 18:06 KennethEnevoldsen

as a part of https://github.com/embeddings-benchmark/mteb/pull/216 where it was pretty close to a merge but sadly never got finished.

Actually, I would be happy to revive #216, but I would need you, the MTEB maintainers, to agree on the interface to do so before I start re-implementing it.

Jun 28 '24 13:06 avidale

Hi @avidale. We have actually added task conditional encoding in #888. Which allows for the encoding as stated above. This makes it possible to create prompts based on tasks (e.g for the instruct e5 models). However, you might just as well use it to

def encode(sentences, prompt_name):
  task = mteb.get_task(prompt_name)
  langs = task.metadata.languages
  # encode text based on languages

The one problem here is multilingual tasks (e.g. dan, eng, fra) where a task can have multiple languages (atm. the model can't know if we are currently encoding eng, fra, or dan). We could still add this.

Jun 29 '24 11:06 KennethEnevoldsen

Will close this issue for now - feel free to reopen in required

Sep 09 '24 16:09 KennethEnevoldsen