mteb icon indicating copy to clipboard operation
mteb copied to clipboard

Model registry: A proposal

Open KennethEnevoldsen opened this issue 10 months ago • 20 comments

We have previously suggested registering models to allow for reproducibility. We also include a lot of metadata on the benchmark leaderboard, which would be nice to register along with the model. My suggestion is as follows:

Implement a model class as follows:

ModelMeta:
   loader = Callable[..., Encoder] | None # if None it will just default to loading the sentence trf. model 
   name: str # ideally name on hugginface
   n_parameters: int
   memory_usage: float
   max_tokens: int
   embedding_dimension: int
   revision: int
   release_date: datetime  # useful for tracking improvement over time on given task
   license: str | None  # required if open source 
   open_source: bool # "Poprietary" / "Open" # we could remove this in favor of license
   framework: list[literal["Sentence Transformers", "PyTorch", ...]]
  languages: list[Language] # language the model is intended for   

I would expect the user interface to be something like so:

model_with_meta = mteb.get_model("intfloat/multilingual-e5-large")
tasks = mteb.get_tasks("eng")

benchmark = MTEB(tasks = tasks)
benchmark.run(model_with_meta)
# or
encoder = model_with_meta.load_encoder()
benchmark.run(encoder) # this is the current interface

What I am looking for is:

  • do you agree with this approach (a thumbs up will do)
  • Any metadata which we should add or remove
  • Any changes to the user interface

Related to #314 and also implemented in the Scandinavian embedding benchmark. Also previously discussed in #475.

Tagging relevant contributors:

  • @tomaarsen, this likely affects the leaderboard
  • @Muennighoff, @imenelydiaker, @x-tabdeveloping, @orionw @isaac-chung, as we discussed during the meeting.

KennethEnevoldsen avatar May 02 '24 08:05 KennethEnevoldsen

That's really intresting! Some suggestions here:

  • We could replace implementation with framework.
  • For languages, can multilingual be an option? We sometimes don't know how many languages a model handles.

imenelydiaker avatar May 02 '24 08:05 imenelydiaker

We could replace implementation with framework.

Agree, updated.

For languages, can multilingual be an option? We sometimes don't know how many languages a model handles.

I wonder if people might overuse the multilingual tags (e.g. even though it is only trained for indo-european languages). A solution might be to have a list of languages, such as indo-european, which people can use. Multilingual would then just be a list of all languages in MTEB.

KennethEnevoldsen avatar May 02 '24 11:05 KennethEnevoldsen

I think this looks good. Just wanted to clarify on the "register" bit. Curious if only some of the model metadata would be stored for each eval run (e.g. only model revision), or if the entire filled class would be stored, e.g. as a separate file. I assume we'd try to read as many fields from the HF model card as well.

isaac-chung avatar May 02 '24 11:05 isaac-chung

So with a registry I simply mean that you are able to fetch the "ModelMeta" for a source and users are allowed to update that source (by registrering their own models). If the ModelMeta isn't registered in MTEB it would default to trying to use extract as many metadata objects as possible if toggled on:

modelmeta = mteb.get_model("my_custom_model", estimate_metadata=True)

I would for a model in results folder have a structure somewhat like:

results
| – {model_name}
|    | – model_meta.json
|    | – {dataset1}.json
|    | – {dataset2}.json
|    | ...
| – ...
|

I am not sure what happens if the model revision is updated. Any suggestions? Potentially solved by saving models using "{model_name}-{revision}".

KennethEnevoldsen avatar May 02 '24 12:05 KennethEnevoldsen

To make sure I understand - this would be the like the results metadata that currently exists in many models but with model-specific data? Seems like a good idea.

We'd still have the same issue of a lot of models not having it filled out, but we could create some repo that stores extra details for ones that don't give it (APIs, old models that people won't change, etc.)

orionw avatar May 02 '24 13:05 orionw

Once we have it in it should then be much easier to require it for result PRs as well if that is what we want to do at least.

Looks like most people agree. I will see if I can find to add this tomorrow, but otherwise I will do it next week.

KennethEnevoldsen avatar May 02 '24 15:05 KennethEnevoldsen

I think versioning the models would make a lot of sense. Also requiring people to specify what has changed between the versions would be very useful (a changelog of some sorts). I found it at least quite confusing that sometimes models can jump several places up or down on the leaderboard without any clarification or indication that these were different versions. (I remember this happening to Mistral E5)

x-tabdeveloping avatar May 03 '24 09:05 x-tabdeveloping

Hmm I don't believe the changelog should lie on our end, but as long as it is only revisions that change it should be clear from the repo (as long as it is changes in implementation is should be clear from the loader).

KennethEnevoldsen avatar May 03 '24 09:05 KennethEnevoldsen

Really great idea! What do you think about making this a new repository? It could become something like a library of Embedding Models (including APIs). Afaict there is no such library atm as SentenceTransformers only covers a specific kind of models. I think that would be better along the axes of

a) Scalability - MTEB stays more light-weight and thus easier to scale; That library would also not be limited by mteb and can scale to potentially become even more impactful than mteb itself b) Usability - People can only use that library too if they just want to load various models; Similarly people will have less reqs to install when installing mteb, as many models will likely require new packages.

Keeping things modular has been a big part of the success of HF I think - making HF/transformers ; HF/datasets ; HF/evaluate one library would have massively limited their scalability I think.

Similarly, having one folder/file for each model at the cost of some duplicate code is why HF/transformers has scaled so well I think. Reusing components saves lines of code but can hinder both scalability and usability like e.g. in this codebase or even partly in sentence T I think.

Curious about your thoughts; Happy to be disagreed with! 😊

Muennighoff avatar May 03 '24 16:05 Muennighoff

Just to start of with, I completely agree the the notion of keeping libraries lightweight and that this allows for better scaling.

The goal of the proposed solution seems to fall in line with the existing model hub which HF provides (though with a few standard interfaces).

The primary intention of this the model registry within MTEB is however to document how code it run specifically on MTEB and I could worry that a lot of the code being benchmark specific (e.g. prompt for tasks). A solution could be to start it of within MTEB (simply intended for documenting models to begin with), but keeping it as a separate module and then potentially splitting it up in the future if the need arise.

Similarly, having one folder/file for each model at the cost of some duplicate code is why HF/transformers has scaled so well I think. ]...]

I e.g. believe implementing models in this way would be a very reasonable approach here as well.

KennethEnevoldsen avatar May 06 '24 12:05 KennethEnevoldsen

Just to start of with, I completely agree the the notion of keeping libraries lightweight and that this allows for better scaling.

The goal of the proposed solution seems to fall in line with the existing model hub which HF provides (though with a few standard interfaces).

The primary intention of this the model registry within MTEB is however to document how code it run specifically on MTEB and I could worry that a lot of the code being benchmark specific (e.g. prompt for tasks). A solution could be to start it of within MTEB (simply intended for documenting models to begin with), but keeping it as a separate module and then potentially splitting it up in the future if the need arise.

Similarly, having one folder/file for each model at the cost of some duplicate code is why HF/transformers has scaled so well I think. ]...]

I e.g. believe implementing models in this way would be a very reasonable approach here as well.

Sounds good to me! We can split it out later if it makes sense and is sufficiently separate (e.g. one could also just store the kwargs inside MTEB to exactly reproduce runs with models from the embedding library)

Muennighoff avatar May 06 '24 14:05 Muennighoff

I totally agree with @KennethEnevoldsen here. I think keep things nice and separate would amount to us making the most stupidly simple implementations of everything here in MTEB and abstract as little behaviour as humanly possible. I think especially for now with our current pace another library would be a liability, as MTEB would be coupled with an external code base that needs to be maintained independently.

I believe if we want to create a one-stop-shop for embedding models we should perhaps contribute to SentenceTransformers.

x-tabdeveloping avatar May 08 '24 13:05 x-tabdeveloping

Thinking about this more, i think our goal here is to keep track of which model version yielded what result on which dataset revision.

Could this proposal be simplified to the following? In the results folder, use the model revision as an intermediate layer between model names and result files.

results
|-- model name
  |-- model revision
    |-- datasetA.json

However the model revision is not currently returned in the SentenceTransformers model. So for this to work, we would need some changes there. Not sure how much lift would be there, maybe @tomaarsen can help answer that.

isaac-chung avatar May 11 '24 11:05 isaac-chung

Model revision should be accessible in Sentence Transformers v3.0 via model.model_card_data.base_model_revision (ETA: ~2 weeks, only need to update the docs). However, not all models will have a revision. In particular, local ones won't.

Another point of note: the name of the loaded model should also become accessible via model.model_card_data.base_model - again only if the model is loaded from Hugging Face.

  • Tom Aarsen

tomaarsen avatar May 11 '24 12:05 tomaarsen

Also, what if the model is not usable in SentenceTransformers? We would ideally still have some version or revision on those. With proprietary embedding models it might also be reasonable to record the date, as those might change at any point without notice from the company.

x-tabdeveloping avatar May 11 '24 12:05 x-tabdeveloping

Thanks @tomaarsen and @x-tabdeveloping. So in terms of date, I'd say that's more or less covered by the date at which the results were committed. The exact date of the model could be part of the model repository to avoid bloating MTEB.

As for models not from Sentence Transformers or HF (those that still match the encoder interface), maybe the solution would be to require the user to supply the revision manually, instead of reading automatically in my proposal above.

I'd like to see if there are any other feedback on the simpler proposal as it would remove the need for the ModelMeta class and tracking a folder/file per model.

isaac-chung avatar May 11 '24 12:05 isaac-chung

Though not sure how this would fit our timeline, and if this is a blocker for model runs for the submission. cc @KennethEnevoldsen

isaac-chung avatar May 11 '24 12:05 isaac-chung

As for models not from Sentence Transformers or HF (those that still match the encoder interface), maybe the solution would be to require the user to supply the revision manually, instead of reading automatically in my proposal above.

Generally, my favorite solution is to check for the few cases where you know that you can automatically gather information, e.g. by checking if a model is a SentenceTransformer instance, and if those automated solutions fail: raise an error that the user must provide that information if it cannot be gathered automatically (or leave it as None with a warning if it's not sufficiently important).

That removes some of the burden from the user, so they don't have to provide extra information that can be gathered automatically in 70% of cases.

  • Tom Aarsen

tomaarsen avatar May 11 '24 13:05 tomaarsen

I agree. If we can construct it from sentence transformers that is the ideal solution.

I am currently working on standardizing the results format (#639). So this issue to open for takers if someone want to give it a go.

KennethEnevoldsen avatar May 11 '24 15:05 KennethEnevoldsen

Oh yeah, i meant to say that the revision would be manually supplied only when it does not exist in the SentenceTransformers class during the automatic checks 👍

I can take a look when v3.0 is released. Thanks again

isaac-chung avatar May 12 '24 11:05 isaac-chung

FYI all - v3.0 released ( btw awesome work @tomaarsen and team!). Will take a look at this this week.

At a glance, if we were to follow this suggested results structure, then the work is in 2 parts:

  1. Use new structure for new results: Will this affect any of the leaderboard scripts?
  2. Convert existing results into this structure: Use the model revisions based on their results addition date. e.g. most recent model file changes are from 10 months ago for intfloat/multilingual-e5-small, so all results added within the last 10 months can use that revision.

isaac-chung avatar May 29 '24 10:05 isaac-chung

I believe this issue is now resolved - thanks for all the comment on this

KennethEnevoldsen avatar Jun 05 '24 18:06 KennethEnevoldsen