hub-docs
                                
                                 hub-docs copied to clipboard
                                
                                    hub-docs copied to clipboard
                            
                            
                            
                        Exploratory Analysis of Models on Hub
Using the huggingface_hub library, I was able to collect some statistics on the 9,984 models that are currently hosted on the Hub. The main goal of this exercise was to find answers to the following questions:
- How many model architectures can be mapped to tasks that we wish to evaluate? For example a model with the BertForSequenceClassificationarchitecture is likely to be about text classification; similarly for the otherModelNameForXxxarchitectures.
- How many models have an architecture, dataset, and metric in their metadata?
- Which tasks are most common?
Number of models per dimension
Without applying any filters on the architecture names, the number of models per criterion is shown in the table below:
| Has architecture | Has dataset | Has metric | Number of models | 
|---|---|---|---|
| ✅ | ❌ | ❌ | 8129 | 
| ✅ | ✅ | ❌ | 1241 | 
| ✅ | ✅ | ✅ | 359 | 
These numbers include models for which a task may not be easily inferred from the architecture alone. For example BertModel would presumably be associated with a feature-extraction task, but these are not simple to evaluate.
By applying a filter on the architecture name to contain any of "For", "MarianMTModel" (translation), or "LMHeadModel" (language modelling), we arrive at the following table:
| Has task | Has dataset | Has metric | Number of models | 
|---|---|---|---|
| ✅ | ❌ | ❌ | 7452 | 
| ✅ | ✅ | ❌ | 1150 | 
| ✅ | ✅ | ✅ | 337 | 
Architecture frequencies
Some models either have no architecture (e.g. the info is missing from the config.json file or the model belongs to another library like Flair), or multiple ones:
| Number of architectures | Number of models | 
|---|---|
| 0 | 1755 | 
| 1 | 8125 | 
| 2 | 1 | 
| 3 | 3 | 
Based on these counts, it thus makes sense to just focus on models with a single architecture.
Number of models per task
For models with a single architecture, I extract the task names from the architecture name according to the following mappings:
- "MarianMTModel" => "Translation"
- architectures containing "LMHeadModel", "LMHead", "MaskedLM", "CausalLM" => "LanguageModeling"
- architectures containing "Model", "DPR", "Encoder" => "Model"
The resulting frequency counts are shown below:
LanguageModeling                    3250
Translation                         1354
SequenceClassification               829
ConditionalGeneration                766
Model                                655
QuestionAnswering                    364
CTC                                  318
TokenClassification                  286
PreTraining                          163
MultipleChoice                        37
MultiLabelSequenceClassification      17
ImageClassification                   15
MultiLabelClassification              11
Generation                             7
ImageClassificationWithTeacher         4
Fun stuff
We can visualise which tasks are connected to which datasets as a graph. Here we show the top 10 tasks (measured by node connectivity) with the top 20 datasets marked in orange

thanks to a tip from @osanseviero and @julien-c, i can improve the analysis by making use of ModelInfo.pipeline_tag to infer the tasks. i'll update the analysis with this improved mapping
Using the ModelInfo approach suggested by @osanseviero and @julien-c makes the analysis much simpler :)
Breakdown by task
First, the pipeline_tag already contains the task information and provides a more realistic grouping of the model architectures:
pipeline_tag        number_of_models
====================================
unknown                         2394
text-generation                 2286
translation                     1373
fill-mask                        958
text-classification              860
text2text-generation             748
question-answering               368
automatic-speech-recognition     329
token-classification             324
summarization                    228
conversational                    32
image-classification              22
audio-source-separation           19
table-question-answering          19
text-to-speech                    17
zero-shot-classification          17
feature-extraction                 8
object-detection                   5
voice-activity-detection           3
image-segmentation                 3
Semantic Similarity                2
sentence-similarity                2
We can see there are two similar-looking tasks: Semantic Similarity and sentence-similarity. By looking at the corresponding model IDs in the table below, we can see that they appear to be models produced using sentence-transformers
| model_id | 
|---|
| Sahajtomar/french_semantic | 
| Sahajtomar/sts-GBERT-de | 
| osanseviero/full-sentence-distillroberta2 | 
| osanseviero/full-sentence-distillroberta3 | 
Suggestion: rename Sentence Similarity to sentence-similarity to match the naming convention of pipeline tags
Drilling down on the unknown pipeline tags
We can see 2,394 models are currently missing a pipeline tag, which is about 24% of all the models currently on the Hub:
| has_pipeline_tag | num_models | 
|---|---|
| True | 7623 | 
| False | 2394 | 
Of the models without a pipeline tag, we can drill down further by asking how many of them have a config.json file:
 
Interestingly, the list of model IDs with a pipeline tag but no config.json file includes models like distilbert-base-uncased, for which an architecture field probably did not exist when this model was trained.
A list of the model IDs is attached:
2021-06-04_models-without-pipeline-tag-with-config.csv
Code snippet to pull metadata
import pandas as pd
from huggingface_hub import HfApi
def get_model_metadata():
    all_models = HfApi().list_models(full=True)
    metadata = []
    for model in all_models:
        has_readme = False
        has_config = False
        has_pipeline_tag = False
        pipeline_tag = "unknown"
        if model.pipeline_tag:
            pipeline_tag = model.pipeline_tag
            has_pipeline_tag = True
        for sibling in model.siblings:
            if sibling.rfilename == "README.md":
                has_readme = True
            if sibling.rfilename == "config.json":
                has_config = True
        metadata.append(
            (
                model.modelId,
                pipeline_tag,
                model.tags,
                has_pipeline_tag,
                has_config,
                has_readme,
            )
        )
    df = pd.DataFrame(
        metadata,
        columns=[
            "model_id",
            "pipeline_tag",
            "tags",
            "has_pipeline_tag",
            "has_config",
            "has_readme",
        ],
    )
    return df
Thanks for the analysis!
re: Semantic similarity. The user overrode the pipeline tag in the METADATA some months ago. I agree that they should be sentence-similarity, which is a fairly recent task.
Some brainstorm ideas for further analysis:
re: If I understand correctly, we have 2394 repos without a pipeline tag and we might want to put some effort on those. From those repos:
- 1253 have a config.json. Is there anything we can obtain from the config to understand what they are? (unrelated: I think we assume that these are Transformer based, so maybe it also makes sense to addtransformerstag, which we currently don't use at the moment.)
- Not necessarily useful, but it might also be worth classifying the ones without a config and without a pipeline tag. We could check if 1. the repos are empty; 2 the repo has a file that could correspond to another library; etc. This might be more challenging though.
we currently assume (see code in ModelInfo.ts, it's fairly short to read) that models that have a config.json file and no library name in their tags is a transformers model.
Lots of model repos are empty (or WIPs) so I wouldn't aim to classify all models.
PS/ fixed the metadata for one model in https://huggingface.co/Sahajtomar/french_semantic/commit/2392beb954ae32dafa587e03f278a0158d1da7b5
and the other in https://huggingface.co/Sahajtomar/sts-GBERT-de/commit/935c5217fd8f03de0ccd9e6e3f34e21651573e84
- pytorch unneeded as it's inferred from the files
- task name can be in tags and will be inferred as the pipeline type
- Fixed the library name (can also be in tags)
Finally, cc'ing model author @Sahajtomar for visibility. Let us know if any issue 🙂
we currently assume (see code in ModelInfo.ts, it's fairly short to read) that models that have a config.json file and no library name in their tags is a transformers model.
Yes, I was suggesting that maybe we should make it more explicit and add the transformers tag to those. As we intend to expand our usage to more libraries, longer-term I think we should reduce the magic that happens in our side, and have transformers as an explicit tag. (related PR https://github.com/huggingface/moon-landing/pull/746)
Yes agreed, probably not short term but when we start adding more validation to the yaml block in models we can 1/ add this rule 2/ update all updateable models on the hub
while checking for suitable datasets for model evaluation, i discovered several models have typos / non-conventional naming for the datasets: tag.
using some fuzzy string matching i compiled a list of (model, dataset, closest_dataset_match) where the closest match to a canonical dataset in datasets was based (arbitrarily) on whether the levenstein distance is > 85.
i wonder whether it would be useful to include some sort of validation in the yaml, along the lines of "did you mean dataset X?" where X is one of the datasets hosted on the hub?
i wonder whether it would be useful to include some sort of validation in the yaml, along the lines of "did you mean dataset X?" where X is one of the datasets hosted on the hub?
Yes 👍
Also check out https://observablehq.com/@huggingface/kaggle-dataset-huggingface-modelhub from @severo which looks great (cc @gary149)