blog How to package Hugging Face into Nvidia Triton Inference Server for deployment

I was recently deploying hugging face models on the Triton inference server which helped me to increase my GPU utilization and serve multiple models using a single GPU.

Was not able to find good resources during the process.

@sayakpaul

Mar 28 '23 03:03 nickaggarwal

@nickaggarwal thank you so much!

In order for us to better understand this, could you provide an outline of the things that you want to cover in the tutorial?

Also cc: @osanseviero @philschmid

Mar 28 '23 03:03 sayakpaul

Hi @sayakpaul

The tutorial would entail, "how to take models from Hugging Face, a machine learning library, and package them into Nvidia Triton, an open-source inference serving software". It would be a detailed 4 step Tutorial :

Getting Started with Hugging Face
Deploying a Huggingface model on Nvidia Triton
Deploying Triton Inference containers in Kubernetes
Efficient utilization of GPUs

Within the tutorial I will also cover how to package and push files to Triton Model Repository, using Hugging Face Pipeline with Template Method to deploy the model, etc.

Mar 28 '23 04:03 nickaggarwal

Sounds good to me, thanks! I will let @osanseviero and @philschmid chime in as well.

Mar 28 '23 06:03 sayakpaul

Thanks, @sayakpaul

@philschmid @osanseviero Do let me know your thoughts

Mar 29 '23 03:03 nickaggarwal

I would be interested too in the tutorial. Nice idea @nickaggarwal

Mar 30 '23 10:03 ghost

Glad to know @dverdu-freepik

Team, should i submit the tutorial blog here?

cc @sayakpaul

Apr 03 '23 04:04 nickaggarwal

@osanseviero a gentle ping.

Apr 05 '23 05:04 sayakpaul

that would indeed be awesome ! Just stumbled upon your blog post "StackLLaMA: A hands-on guide to train LLaMA with RLHF" and the demo on that page. How did you do the deployment in that particular post ? Its incredible fast... Any information would be very welcome :)

Apr 07 '23 08:04 alexanderfrey

+1 the docs for this should ideally be much clearer

Apr 08 '23 19:04 ankit-db

Thanks, folks! @osanseviero - a gentle reminder! would love to contribute through this tutorial.

Apr 10 '23 04:04 nickaggarwal

@nickaggarwal Can you also add text streaming for the text output to your tutorial ?

Apr 11 '23 02:04 MohamedAliRashad

Hi there! Thanks a lot for the proposal!

We discussed this with the team, and we're not sure the blog will be the best place for this. This is more like a production guide for very specific hardware/use cases. We've had some blog posts like this in the past, but we realized that they didn't have good visibility for the amount of effort behind them. There are likely better venues to expose this kind of content to the community and we're always happy to amplify it!

Apr 11 '23 15:04 osanseviero

I think this is Triton specific information, and would be covered best by the Triton team. Does this Tutorial https://github.com/triton-inference-server/tutorials/tree/main/HuggingFace cover what you need to know?

Apr 13 '23 19:04 timeleft--

我认为这是Triton特定的信息，Triton团队最好地涵盖这些信息。本教程 https://github.com/triton-inference-server/tutorials/tree/main/HuggingFace 涵盖您需要了解的内容吗？

That's useful.

Jul 06 '23 07:07 alexw994

Maybe we must discuss finding a simple way to convert the transformers model to Tensorrt.

Jul 06 '23 08:07 alexw994

Hi @nickaggarwal Do you have any resources on how to package and deploy Hugging Face into Nvidia Triton Inference? Many thanks!

Jul 18 '23 10:07 n-imas

yes, some;) Вт, 18 июля 2023 г. в 04:34, n-imas @.***>:

Hi @nickaggarwal https://github.com/nickaggarwal Do you have any resources on how to package and deploy Hugging Face into Nvidia Triton Inference? Many thanks!

— Reply to this email directly, view it on GitHub https://github.com/huggingface/blog/issues/972#issuecomment-1639968915, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTDQNEHG7WNFKVMS4DJBY3XQZRBXANCNFSM6AAAAAAWJ7G73Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

С уважением, Виктор. https://t.me/doggydog https://github.com/vzip

Jul 19 '23 02:07 vzip

Thanks, folks for showing interest in the tutorial. We ended up publishing the same on our blog. You can access it here- https://www.inferless.com/learn/nvidia-triton-inference-inferless

Jul 19 '23 23:07 nickaggarwal

I used a Conda pack for packaging the dependencies for the Triton server.

conda create -k -y -n hf-sentiment python=3.10

conda activate hf-sentiment

pip install numpy conda-pack

pip install torch==1.13.1

pip install transformers==4.21.3

conda install -c conda-forge gcc=12.1.0 # optional if you get issue "nvidia triton version `GLIBCXX_3.4.30' not found"

conda pack -o hf-sentiment.tar.gz

Here is a complete example of a running Hugging Face Sentiment Model (cardiffnlp/twitter-roberta-base-sentiment-latest) Code: https://github.com/satendrakumar/huggingface-triton-server Blog : https://satendrakumar.in/2023/08/07/deploying-hugging-face-model-on-nvidia-triton-inference-server/

Aug 08 '23 05:08 satendrakumar

Hi, I followed the same tutorial to deploy a asr model with language model processor. Its a telugu model. It runs fine everywhere. And also in the docker container if I open a python repl the processor gets loaded without any error, but when I try launching the server, I get this error: try to load on triton server it gives the unicode error ascii' codec can't decode byte 0xe0 in position 0 ordinal not in range(128). I tried adding encoding as utf8 in python file as well but it doesnt work. I followed the tutorial for python_vit itself where model and processor simply gets used in python file without exporting to onnx. Can you please provide some guidance what changes should i do

Feb 25 '24 06:02 Jank14

Hi @Jank14

Seems like the Issue with Input/Output, Make sure the test you are sending to the endpoint is encoded in base 64, If you can share the sample input happy to help

Feb 25 '24 22:02 nickaggarwal

@nickaggarwal would you be interested in authoring a guest post like this https://huggingface.co/blog/mlabonne/merge-models?

Just checking with @osanseviero -- it should be okay no?

Feb 26 '24 02:02 sayakpaul

@nickaggarwal would you be interested in authoring a guest post like this https://huggingface.co/blog/mlabonne/merge-models?

Just checking with @osanseviero -- it should be okay no?

@sayakpaul Sure, We do love this for Nvidia triton, with Ensemble models

Feb 26 '24 04:02 nickaggarwal

Hi @Jank14

Seems like the Issue with Input/Output, Make sure the test you are sending to the endpoint is encoded in base 64, If you can share the sample input happy to help

Hey actually it was issue with triton server not recognising the telugu tokens, got it resolved by running export PYTHONIOENCODING="utf-8" and apt-get install locales && locale-gen en_US.UTF-8. Thanks

Feb 27 '24 05:02 Jank14

@Jank14 hey do you mind sharing ASR Triton inference code.

I am pretty much stuck on understand how did you convert the ASR model to work with the config file.

Mar 25 '24 18:03 StephennFernandes

blog blog copied to clipboard

How to package Hugging Face into Nvidia Triton Inference Server for deployment

blog
blog copied to clipboard