blog
blog copied to clipboard
How to package Hugging Face into Nvidia Triton Inference Server for deployment
I was recently deploying hugging face models on the Triton inference server which helped me to increase my GPU utilization and serve multiple models using a single GPU.
Was not able to find good resources during the process.
@sayakpaul
@nickaggarwal thank you so much!
In order for us to better understand this, could you provide an outline of the things that you want to cover in the tutorial?
Also cc: @osanseviero @philschmid
Hi @sayakpaul
The tutorial would entail, "how to take models from Hugging Face, a machine learning library, and package them into Nvidia Triton, an open-source inference serving software". It would be a detailed 4 step Tutorial :
- Getting Started with Hugging Face
- Deploying a Huggingface model on Nvidia Triton
- Deploying Triton Inference containers in Kubernetes
- Efficient utilization of GPUs
Within the tutorial I will also cover how to package and push files to Triton Model Repository, using Hugging Face Pipeline with Template Method to deploy the model, etc.
Sounds good to me, thanks! I will let @osanseviero and @philschmid chime in as well.
Thanks, @sayakpaul
@philschmid @osanseviero Do let me know your thoughts
I would be interested too in the tutorial. Nice idea @nickaggarwal
Glad to know @dverdu-freepik
Team, should i submit the tutorial blog here?
cc @sayakpaul
@osanseviero a gentle ping.
that would indeed be awesome ! Just stumbled upon your blog post "StackLLaMA: A hands-on guide to train LLaMA with RLHF" and the demo on that page. How did you do the deployment in that particular post ? Its incredible fast... Any information would be very welcome :)
+1 the docs for this should ideally be much clearer
Thanks, folks! @osanseviero - a gentle reminder! would love to contribute through this tutorial.
@nickaggarwal Can you also add text streaming for the text output to your tutorial ?
Hi there! Thanks a lot for the proposal!
We discussed this with the team, and we're not sure the blog will be the best place for this. This is more like a production guide for very specific hardware/use cases. We've had some blog posts like this in the past, but we realized that they didn't have good visibility for the amount of effort behind them. There are likely better venues to expose this kind of content to the community and we're always happy to amplify it!
I think this is Triton specific information, and would be covered best by the Triton team. Does this Tutorial https://github.com/triton-inference-server/tutorials/tree/main/HuggingFace cover what you need to know?
我认为这是Triton特定的信息,Triton团队最好地涵盖这些信息。本教程 https://github.com/triton-inference-server/tutorials/tree/main/HuggingFace 涵盖您需要了解的内容吗?
That's useful.
Maybe we must discuss finding a simple way to convert the transformers model to Tensorrt.
Hi @nickaggarwal Do you have any resources on how to package and deploy Hugging Face into Nvidia Triton Inference? Many thanks!
yes, some;) Вт, 18 июля 2023 г. в 04:34, n-imas @.***>:
Hi @nickaggarwal https://github.com/nickaggarwal Do you have any resources on how to package and deploy Hugging Face into Nvidia Triton Inference? Many thanks!
— Reply to this email directly, view it on GitHub https://github.com/huggingface/blog/issues/972#issuecomment-1639968915, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTDQNEHG7WNFKVMS4DJBY3XQZRBXANCNFSM6AAAAAAWJ7G73Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>
--
С уважением, Виктор. https://t.me/doggydog https://github.com/vzip
Thanks, folks for showing interest in the tutorial. We ended up publishing the same on our blog. You can access it here- https://www.inferless.com/learn/nvidia-triton-inference-inferless
I used a Conda pack for packaging the dependencies for the Triton server.
conda create -k -y -n hf-sentiment python=3.10
conda activate hf-sentiment
pip install numpy conda-pack
pip install torch==1.13.1
pip install transformers==4.21.3
conda install -c conda-forge gcc=12.1.0 # optional if you get issue "nvidia triton version `GLIBCXX_3.4.30' not found"
conda pack -o hf-sentiment.tar.gz
Here is a complete example of a running Hugging Face Sentiment Model (cardiffnlp/twitter-roberta-base-sentiment-latest) Code: https://github.com/satendrakumar/huggingface-triton-server Blog : https://satendrakumar.in/2023/08/07/deploying-hugging-face-model-on-nvidia-triton-inference-server/
Hi, I followed the same tutorial to deploy a asr model with language model processor. Its a telugu model. It runs fine everywhere. And also in the docker container if I open a python repl the processor gets loaded without any error, but when I try launching the server, I get this error: try to load on triton server it gives the unicode error ascii' codec can't decode byte 0xe0 in position 0 ordinal not in range(128). I tried adding encoding as utf8 in python file as well but it doesnt work. I followed the tutorial for python_vit itself where model and processor simply gets used in python file without exporting to onnx. Can you please provide some guidance what changes should i do
Hi @Jank14
Seems like the Issue with Input/Output, Make sure the test you are sending to the endpoint is encoded in base 64, If you can share the sample input happy to help
@nickaggarwal would you be interested in authoring a guest post like this https://huggingface.co/blog/mlabonne/merge-models?
Just checking with @osanseviero -- it should be okay no?
@nickaggarwal would you be interested in authoring a guest post like this https://huggingface.co/blog/mlabonne/merge-models?
Just checking with @osanseviero -- it should be okay no?
@sayakpaul Sure, We do love this for Nvidia triton, with Ensemble models
Hi @Jank14
Seems like the Issue with Input/Output, Make sure the test you are sending to the endpoint is encoded in base 64, If you can share the sample input happy to help
Hey actually it was issue with triton server not recognising the telugu tokens, got it resolved by running export PYTHONIOENCODING="utf-8" and apt-get install locales && locale-gen en_US.UTF-8. Thanks
@Jank14 hey do you mind sharing ASR Triton inference code.
I am pretty much stuck on understand how did you convert the ASR model to work with the config file.