Add models to Hugging Face Hub
Hey @alexstoken 👋
I am Aritra from Hugging Face, and I stumbled upon your repository from this tweet. While looking into the repository, I saw that all the model weights are hosted at different places by the authors. I propose a unification of model hosting.
We can upload all the models to Hugging Face Hub, under an organisation.
Authors can be part of the org, and upload their models and use the link (as it is now in the code base) to download the model using the Hugging Face Hub Python Library. This means we do not want you to add the models to transformers code base, which is a lot of effort, but keep the models at a unified place with the ease of downloading it from the Hub with the python library.
We keep the autonomy in your hands, but unify all the models to one place. Adding the models to Hugging Face would increase visibility as we have noticed with a lot of orgs and teams, but also make it easier to track the downloads of each model.
Let me know what you think about it, and how I could help you in the process.
Hi @ariG23498!
Thanks for outlining the process of joining the HF Hub. I'm interested, but have a few more questions.
I was looking at the docs, would end users need to get an access token to download models? I understand that as providers/model uploaders, we would be need tokens, but it seems like a barrier to ask all users to make a token to download (esp as they don't need tokens now). Are models on the hub still available via a fixed url, or only via the HF_hub downloader?
Second, do you know if there are any limitations with uploading models with various licenses? As we are not the model creators, but simply a unifying interface, I want to be sure we can safely upload most/all models. I think since the models are already openly available we are just moving them to a new location so all is permissible, but curious if you have thoughts on this.
The goal is always to make the repo more accessible to the community and increasing engagement, and this seems like a good step in that direction.
I am glad that you are interested in this @alexstoken 😄
would end users need to get an access token to download models?
Authentication is recommended but still optional to access public models or datasets. If the model is gated or private the users would need a token to access the model. This also brings me to another pro, where it is possible to gate models (based on some logic, mostly used for geographical access) which was not possible with the current setup of models on Google Drive.
Are models on the hub still available via a fixed url, or only via the HF_hub downloader?
Once a model is uploaded to the Hub, you will get a fixed URL, which is not curl or wget abled. We use the huggingface_hub python library for upload and download of the models. As far as I understand it, we would only need the snapshot_download API for downloading the models uploaded to the hub.
An example where I wanted to download the PaddleOCR-VL model is as follows:
from huggingface_hub import snapshot_download
path_to_model = snapshot_download(
repo_id="PaddlePaddle/PaddleOCR-VL",
)
!ls $path_to_model
# added_tokens.json PP-DocLayoutV2
# chat_template.jinja preprocessor_config.json
# config.json processing_paddleocr_vl.py
# configuration_paddleocr_vl.py processor_config.json
# generation_config.json README.md
# image_processing.py special_tokens_map.json
# inference.yml tokenizer_config.json
# LICENSE tokenizer.json
# modeling_paddleocr_vl.py tokenizer.model
# model.safetensors
Second, do you know if there are any limitations with uploading models with various licenses
You can add the licenses to individual models right inside the model card.
As we are not the model creators, but simply a unifying interface, I want to be sure we can safely upload most/all models
If the models are publicly accessible at the current time, it would not be a problem to host the models on the Hugging Face Hub, provided we specify the license of usage of the models in the model card, so that the end user is aware of the model usage.
Hope the answers made sense to you, please do not hesitate to ask more questions. I would also like to know if there is a single most used model that is being used currently. I would like to host the model to the Hub, and try to build a MVP using the hub apis. This would help us gauge how easy it would be to carry out this project. What do you think?
Hi @ariG23498 thank you for the clear reply, the example makes things clearer. We'd be happy to move forward, and having the models on the hub would also help us track which models get downloaded the most. Probably the most used model currently is Superpoint + Lightglue, at least based on their GitHub stars. So if you could get that one done, and perhaps also add some instructions on the CONTRIBUTING.md, then we could slowly move towards the hub.
Hi @alexstoken @gmberton
I noticed that for Supoerpoint + Lightglue, all the models are hosted in a GitHub Release. Here is a colab notebook that I have used to first download the models, and then upload them to the Hugging Face Hub.
All the models are now inside: https://huggingface.co/ariG23498/LightGlue
I wanted to make minimal changes to the code, so I did not convert the .pth files into safetensors, but it is very easy and is highly recommended to do (if you are more interested, I could create a notebook on that as well).
We can now use absolute paths from the Hub and use it with torch.hub.load_state_dict_from_url API as is done inside the LightGlue GitHub repository.
- I have added a PR to track download of the
lightgluemodels. - Also added a PR to lightglue to add HF paths.
Thank you @ariG23498 !
How do you suggest that we proceed? Should I create an image-matching-models org through this link?
Then create an issue with labels good first issue and help wanted asking for help transferring the weights of each model to the new org?
Thanks @ariG23498! This looks great so far. Really like the torch.hub.load_state_dict_from_url feature as well.
I am also interested in the safetensors conversion, if you dont mind making a demo nb for that.
Will track those issues you mentioned and let @gmberton I know next best steps.
Hi @gmberton An org on image-matching-models would be really nice. (the link seems to be correct)
Then create an issue with labels good first issue and help wanted asking for help transferring the weights of each model to the new org?
My personal opinion would be to wait for some time to see how the authors of LightGlue respond to the PR. This is going to be a bottleneck for our work, let me ask you, if this is not merged would you like to fork their repo and add the HF links, and add that git submodule inside the current image-matching-models repo?
I am also interested in the safetensors conversion, if you dont mind making a demo nb for that.
Here is a minimal script showcasing how to convert the .pth files into .safetensors and then using it.
from huggingface_hub import snapshot_download
from safetensors.torch import save_file
from safetensors import safe_open
import torch
path = snapshot_download("ariG23498/LightGlue")
superpoint_v1_path = f"{path}/superpoint_v1.pth"
# open the pth file to get the weight dictionary
weight_dict = torch.load(superpoint_v1_path)
# save the weight dictionary as safetensors
save_file(weight_dict, "superpoint_v1.safetensors")
# open the safetensors as weight dictionary
tensors = {}
with safe_open("superpoint_v1.safetensors", framework="pt") as f:
for k in f.keys():
tensors[k] = f.get_tensor(k)
tensors.keys()
Hi @ariG23498 ,
Thanks for the safetensors script, that seems quite doable!
I'm not sure why we need to submit PRs or fork each third party repo. Can't we just add the models to the to-be-created image-matching-models org and then update the download_weights function (i.e. in ELoFTR) for each model?
In this case LightGlue might be a bit of an exception. For most other models we have the download_weights function and do that within this repo rather than within the submodule. In these cases, do you see a need for PRs?
@alexstoken that is a brilliant point. I might have extrapolated quite a bit using LightGlue as an example.
Here are the steps that I took for ELoFTR:
- Dowloaded the model from Google Drive
!pip install -q pytorch_lightning # necessary for the download
from pathlib import Path
from safetensors.torch import save_file
from huggingface_hub import upload_file
import gdown
import torch
def download_weights(weights_src, model_path):
gdown.download(
weights_src,
output=model_path,
fuzzy=True,
)
weights_src = "https://drive.google.com/file/d/1jFy2JbMKlIp82541TakhQPaoyB5qDeic/view"
model_path = "eloftr_outdoor.ckpt"
download_weights(weights_src=weights_src, model_path=model_path)
- Save the state dict as a safetensor
state_dict = torch.load(model_path, map_location=torch.device("cpu"), weights_only=False)["state_dict"]
save_file(state_dict, "eloftr_outdoors.safetensors")
- Upload the safetensor file to the Hub
from huggingface_hub import HfApi
api = HfApi()
api.upload_file(
path_or_fileobj="eloftr_outdoors.safetensors",
path_in_repo="eloftr_outdoors.safetensors ",
repo_id="ariG23498/eloftr",
)
You can access the weights here: https://huggingface.co/ariG23498/eloftr
I have also created a draft PR so that we can move holistically to the Hub using this example.
I can also add this workflow to the contributing guide so that we can start delegating this to the community. What do you think?
I have created the image-matching-models org!
I will start populating it with some models in the next days. I'll try to see if I encounter into any major issues: if I don't encounter any issues we can then start asking for help from the community.
And yes @ariG23498, adding some instructions on how to do this on the CONTRIBUTING.md would be great, thanks! It might just be good to make the upload to HF an optional step (some authors might prefer to not upload the weights on HF but store them themselves).
This looks great! Thanks both for getting this running. PR looks nice - should be simple to replicate for the rest of the models.
Agree with adding to the CONTRIBUTING.MD.
We can leave this issue open until @gmberton and I upload a few models to test out the workflow, though I imagine there won't be any issues.
We should be sure the next PR adds the HF Hub to requirements.txt.
@gmberton thanks for creating the org. I have sent a request to join it as well.
Here is the PR for contribution.
A question to @alexstoken and @gmberton:
Would you prefer a tag image-matching-models which should be added to models under the org for better visibility?
PR merged and added you as admin to the org @ariG23498!
And yes, I see only pros of having the image-matching-models tag, I think we should do that.
Thanks @gmberton 🤟
I have started with the eloftr model, and have added it to the organisation.
I am also trying to add the image-matching-models tag to our backend for better searchability and model download counts. After this is merged I will update the contribution guide to add this tag.
Also, I think we should communicate about this new org and our collaboration in twitter and linkedin. This should reach to a wider audience. But we can decide on this later, once we have more models in the org. What do you both think?
Yes, that sounds good. Better visibility might also help us find more collaborators to add new models
Cool! Let me know when you folks want to communicate on the socials. I would love to amplify from my end.
That sounds great, we appreciate the amplification and support!
For me, a reasonable target for having most models prepared is the first week of December. I should be able to add many models in that time. How does that sound to you both? @ariG23498 @gmberton
Works for me! 🚀
Let me know in case you folks face some issue.