mteb Model registry: A proposal

We have previously suggested registering models to allow for reproducibility. We also include a lot of metadata on the benchmark leaderboard, which would be nice to register along with the model. My suggestion is as follows:

Implement a model class as follows:

ModelMeta:
   loader = Callable[..., Encoder] | None # if None it will just default to loading the sentence trf. model 
   name: str # ideally name on hugginface
   n_parameters: int
   memory_usage: float
   max_tokens: int
   embedding_dimension: int
   revision: int
   release_date: datetime  # useful for tracking improvement over time on given task
   license: str | None  # required if open source 
   open_source: bool # "Poprietary" / "Open" # we could remove this in favor of license
   framework: list[literal["Sentence Transformers", "PyTorch", ...]]
  languages: list[Language] # language the model is intended for

I would expect the user interface to be something like so:

model_with_meta = mteb.get_model("intfloat/multilingual-e5-large")
tasks = mteb.get_tasks("eng")

benchmark = MTEB(tasks = tasks)
benchmark.run(model_with_meta)
# or
encoder = model_with_meta.load_encoder()
benchmark.run(encoder) # this is the current interface

What I am looking for is:

do you agree with this approach (a thumbs up will do)
Any metadata which we should add or remove
Any changes to the user interface

Related to #314 and also implemented in the Scandinavian embedding benchmark. Also previously discussed in #475.

Tagging relevant contributors:

@tomaarsen, this likely affects the leaderboard
@Muennighoff, @imenelydiaker, @x-tabdeveloping, @orionw @isaac-chung, as we discussed during the meeting.

May 02 '24 08:05 KennethEnevoldsen

That's really intresting! Some suggestions here:

We could replace implementation with framework.
For languages, can multilingual be an option? We sometimes don't know how many languages a model handles.

May 02 '24 08:05 imenelydiaker

We could replace implementation with framework.

Agree, updated.

For languages, can multilingual be an option? We sometimes don't know how many languages a model handles.

I wonder if people might overuse the multilingual tags (e.g. even though it is only trained for indo-european languages). A solution might be to have a list of languages, such as indo-european, which people can use. Multilingual would then just be a list of all languages in MTEB.

May 02 '24 11:05 KennethEnevoldsen

I think this looks good. Just wanted to clarify on the "register" bit. Curious if only some of the model metadata would be stored for each eval run (e.g. only model revision), or if the entire filled class would be stored, e.g. as a separate file. I assume we'd try to read as many fields from the HF model card as well.

May 02 '24 11:05 isaac-chung

So with a registry I simply mean that you are able to fetch the "ModelMeta" for a source and users are allowed to update that source (by registrering their own models). If the ModelMeta isn't registered in MTEB it would default to trying to use extract as many metadata objects as possible if toggled on:

modelmeta = mteb.get_model("my_custom_model", estimate_metadata=True)

I would for a model in results folder have a structure somewhat like:

results
| – {model_name}
|    | – model_meta.json
|    | – {dataset1}.json
|    | – {dataset2}.json
|    | ...
| – ...
|

I am not sure what happens if the model revision is updated. Any suggestions? Potentially solved by saving models using "{model_name}-{revision}".

May 02 '24 12:05 KennethEnevoldsen

To make sure I understand - this would be the like the results metadata that currently exists in many models but with model-specific data? Seems like a good idea.

We'd still have the same issue of a lot of models not having it filled out, but we could create some repo that stores extra details for ones that don't give it (APIs, old models that people won't change, etc.)

May 02 '24 13:05 orionw

Once we have it in it should then be much easier to require it for result PRs as well if that is what we want to do at least.

Looks like most people agree. I will see if I can find to add this tomorrow, but otherwise I will do it next week.

May 02 '24 15:05 KennethEnevoldsen

I think versioning the models would make a lot of sense. Also requiring people to specify what has changed between the versions would be very useful (a changelog of some sorts). I found it at least quite confusing that sometimes models can jump several places up or down on the leaderboard without any clarification or indication that these were different versions. (I remember this happening to Mistral E5)

May 03 '24 09:05 x-tabdeveloping

Hmm I don't believe the changelog should lie on our end, but as long as it is only revisions that change it should be clear from the repo (as long as it is changes in implementation is should be clear from the loader).

May 03 '24 09:05 KennethEnevoldsen

Really great idea! What do you think about making this a new repository? It could become something like a library of Embedding Models (including APIs). Afaict there is no such library atm as SentenceTransformers only covers a specific kind of models. I think that would be better along the axes of

a) Scalability - MTEB stays more light-weight and thus easier to scale; That library would also not be limited by mteb and can scale to potentially become even more impactful than mteb itself b) Usability - People can only use that library too if they just want to load various models; Similarly people will have less reqs to install when installing mteb, as many models will likely require new packages.

Keeping things modular has been a big part of the success of HF I think - making HF/transformers ; HF/datasets ; HF/evaluate one library would have massively limited their scalability I think.

Similarly, having one folder/file for each model at the cost of some duplicate code is why HF/transformers has scaled so well I think. Reusing components saves lines of code but can hinder both scalability and usability like e.g. in this codebase or even partly in sentence T I think.

Curious about your thoughts; Happy to be disagreed with! 😊

May 03 '24 16:05 Muennighoff

Just to start of with, I completely agree the the notion of keeping libraries lightweight and that this allows for better scaling.

The goal of the proposed solution seems to fall in line with the existing model hub which HF provides (though with a few standard interfaces).

The primary intention of this the model registry within MTEB is however to document how code it run specifically on MTEB and I could worry that a lot of the code being benchmark specific (e.g. prompt for tasks). A solution could be to start it of within MTEB (simply intended for documenting models to begin with), but keeping it as a separate module and then potentially splitting it up in the future if the need arise.

Similarly, having one folder/file for each model at the cost of some duplicate code is why HF/transformers has scaled so well I think. ]...]

I e.g. believe implementing models in this way would be a very reasonable approach here as well.

May 06 '24 12:05 KennethEnevoldsen

Just to start of with, I completely agree the the notion of keeping libraries lightweight and that this allows for better scaling.

The goal of the proposed solution seems to fall in line with the existing model hub which HF provides (though with a few standard interfaces).

The primary intention of this the model registry within MTEB is however to document how code it run specifically on MTEB and I could worry that a lot of the code being benchmark specific (e.g. prompt for tasks). A solution could be to start it of within MTEB (simply intended for documenting models to begin with), but keeping it as a separate module and then potentially splitting it up in the future if the need arise.

Similarly, having one folder/file for each model at the cost of some duplicate code is why HF/transformers has scaled so well I think. ]...]

I e.g. believe implementing models in this way would be a very reasonable approach here as well.

Sounds good to me! We can split it out later if it makes sense and is sufficiently separate (e.g. one could also just store the kwargs inside MTEB to exactly reproduce runs with models from the embedding library)

May 06 '24 14:05 Muennighoff

I totally agree with @KennethEnevoldsen here. I think keep things nice and separate would amount to us making the most stupidly simple implementations of everything here in MTEB and abstract as little behaviour as humanly possible. I think especially for now with our current pace another library would be a liability, as MTEB would be coupled with an external code base that needs to be maintained independently.

I believe if we want to create a one-stop-shop for embedding models we should perhaps contribute to SentenceTransformers.

May 08 '24 13:05 x-tabdeveloping

Thinking about this more, i think our goal here is to keep track of which model version yielded what result on which dataset revision.

Could this proposal be simplified to the following? In the results folder, use the model revision as an intermediate layer between model names and result files.

results
|-- model name
  |-- model revision
    |-- datasetA.json

However the model revision is not currently returned in the SentenceTransformers model. So for this to work, we would need some changes there. Not sure how much lift would be there, maybe @tomaarsen can help answer that.

May 11 '24 11:05 isaac-chung

Model revision should be accessible in Sentence Transformers v3.0 via model.model_card_data.base_model_revision (ETA: ~2 weeks, only need to update the docs). However, not all models will have a revision. In particular, local ones won't.

Another point of note: the name of the loaded model should also become accessible via model.model_card_data.base_model - again only if the model is loaded from Hugging Face.

Tom Aarsen

May 11 '24 12:05 tomaarsen

Also, what if the model is not usable in SentenceTransformers? We would ideally still have some version or revision on those. With proprietary embedding models it might also be reasonable to record the date, as those might change at any point without notice from the company.

May 11 '24 12:05 x-tabdeveloping

Thanks @tomaarsen and @x-tabdeveloping. So in terms of date, I'd say that's more or less covered by the date at which the results were committed. The exact date of the model could be part of the model repository to avoid bloating MTEB.

As for models not from Sentence Transformers or HF (those that still match the encoder interface), maybe the solution would be to require the user to supply the revision manually, instead of reading automatically in my proposal above.

I'd like to see if there are any other feedback on the simpler proposal as it would remove the need for the ModelMeta class and tracking a folder/file per model.

May 11 '24 12:05 isaac-chung

Though not sure how this would fit our timeline, and if this is a blocker for model runs for the submission. cc @KennethEnevoldsen

May 11 '24 12:05 isaac-chung

As for models not from Sentence Transformers or HF (those that still match the encoder interface), maybe the solution would be to require the user to supply the revision manually, instead of reading automatically in my proposal above.

Generally, my favorite solution is to check for the few cases where you know that you can automatically gather information, e.g. by checking if a model is a SentenceTransformer instance, and if those automated solutions fail: raise an error that the user must provide that information if it cannot be gathered automatically (or leave it as None with a warning if it's not sufficiently important).

That removes some of the burden from the user, so they don't have to provide extra information that can be gathered automatically in 70% of cases.

Tom Aarsen

May 11 '24 13:05 tomaarsen

I agree. If we can construct it from sentence transformers that is the ideal solution.

I am currently working on standardizing the results format (#639). So this issue to open for takers if someone want to give it a go.

May 11 '24 15:05 KennethEnevoldsen

Oh yeah, i meant to say that the revision would be manually supplied only when it does not exist in the SentenceTransformers class during the automatic checks 👍

I can take a look when v3.0 is released. Thanks again

May 12 '24 11:05 isaac-chung

FYI all - v3.0 released ( btw awesome work @tomaarsen and team!). Will take a look at this this week.

At a glance, if we were to follow this suggested results structure, then the work is in 2 parts:

Use new structure for new results: Will this affect any of the leaderboard scripts?
Convert existing results into this structure: Use the model revisions based on their results addition date. e.g. most recent model file changes are from 10 months ago for intfloat/multilingual-e5-small, so all results added within the last 10 months can use that revision.

May 29 '24 10:05 isaac-chung

I believe this issue is now resolved - thanks for all the comment on this

Jun 05 '24 18:06 KennethEnevoldsen

mteb mteb copied to clipboard

Model registry: A proposal

mteb
mteb copied to clipboard