blog icon indicating copy to clipboard operation
blog copied to clipboard

How to package Hugging Face into Nvidia Triton Inference Server for deployment

Open nickaggarwal opened this issue 2 years ago • 25 comments

I was recently deploying hugging face models on the Triton inference server which helped me to increase my GPU utilization and serve multiple models using a single GPU.

Was not able to find good resources during the process.

@sayakpaul

nickaggarwal avatar Mar 28 '23 03:03 nickaggarwal

@nickaggarwal thank you so much!

In order for us to better understand this, could you provide an outline of the things that you want to cover in the tutorial?

Also cc: @osanseviero @philschmid

sayakpaul avatar Mar 28 '23 03:03 sayakpaul

Hi @sayakpaul

The tutorial would entail, "how to take models from Hugging Face, a machine learning library, and package them into Nvidia Triton, an open-source inference serving software". It would be a detailed 4 step Tutorial :

  1. Getting Started with Hugging Face
  2. Deploying a Huggingface model on Nvidia Triton
  3. Deploying Triton Inference containers in Kubernetes
  4. Efficient utilization of GPUs

Within the tutorial I will also cover how to package and push files to Triton Model Repository, using Hugging Face Pipeline with Template Method to deploy the model, etc.

nickaggarwal avatar Mar 28 '23 04:03 nickaggarwal

Sounds good to me, thanks! I will let @osanseviero and @philschmid chime in as well.

sayakpaul avatar Mar 28 '23 06:03 sayakpaul

Thanks, @sayakpaul

@philschmid @osanseviero Do let me know your thoughts

nickaggarwal avatar Mar 29 '23 03:03 nickaggarwal

I would be interested too in the tutorial. Nice idea @nickaggarwal

ghost avatar Mar 30 '23 10:03 ghost

Glad to know @dverdu-freepik

Team, should i submit the tutorial blog here?

cc @sayakpaul

nickaggarwal avatar Apr 03 '23 04:04 nickaggarwal

@osanseviero a gentle ping.

sayakpaul avatar Apr 05 '23 05:04 sayakpaul

that would indeed be awesome ! Just stumbled upon your blog post "StackLLaMA: A hands-on guide to train LLaMA with RLHF" and the demo on that page. How did you do the deployment in that particular post ? Its incredible fast... Any information would be very welcome :)

alexanderfrey avatar Apr 07 '23 08:04 alexanderfrey

+1 the docs for this should ideally be much clearer

ankit-db avatar Apr 08 '23 19:04 ankit-db

Thanks, folks! @osanseviero - a gentle reminder! would love to contribute through this tutorial.

nickaggarwal avatar Apr 10 '23 04:04 nickaggarwal

@nickaggarwal Can you also add text streaming for the text output to your tutorial ?

MohamedAliRashad avatar Apr 11 '23 02:04 MohamedAliRashad

Hi there! Thanks a lot for the proposal!

We discussed this with the team, and we're not sure the blog will be the best place for this. This is more like a production guide for very specific hardware/use cases. We've had some blog posts like this in the past, but we realized that they didn't have good visibility for the amount of effort behind them. There are likely better venues to expose this kind of content to the community and we're always happy to amplify it!

osanseviero avatar Apr 11 '23 15:04 osanseviero

I think this is Triton specific information, and would be covered best by the Triton team. Does this Tutorial https://github.com/triton-inference-server/tutorials/tree/main/HuggingFace cover what you need to know?

timeleft-- avatar Apr 13 '23 19:04 timeleft--

我认为这是Triton特定的信息,Triton团队最好地涵盖这些信息。本教程 https://github.com/triton-inference-server/tutorials/tree/main/HuggingFace 涵盖您需要了解的内容吗?

That's useful.

alexw994 avatar Jul 06 '23 07:07 alexw994

Maybe we must discuss finding a simple way to convert the transformers model to Tensorrt.

alexw994 avatar Jul 06 '23 08:07 alexw994

Hi @nickaggarwal Do you have any resources on how to package and deploy Hugging Face into Nvidia Triton Inference? Many thanks!

n-imas avatar Jul 18 '23 10:07 n-imas

yes, some;) Вт, 18 июля 2023 г. в 04:34, n-imas @.***>:

Hi @nickaggarwal https://github.com/nickaggarwal Do you have any resources on how to package and deploy Hugging Face into Nvidia Triton Inference? Many thanks!

— Reply to this email directly, view it on GitHub https://github.com/huggingface/blog/issues/972#issuecomment-1639968915, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTDQNEHG7WNFKVMS4DJBY3XQZRBXANCNFSM6AAAAAAWJ7G73Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

С уважением, Виктор. https://t.me/doggydog https://github.com/vzip

vzip avatar Jul 19 '23 02:07 vzip

Thanks, folks for showing interest in the tutorial. We ended up publishing the same on our blog. You can access it here- https://www.inferless.com/learn/nvidia-triton-inference-inferless

nickaggarwal avatar Jul 19 '23 23:07 nickaggarwal

I used a Conda pack for packaging the dependencies for the Triton server.

conda create -k -y -n hf-sentiment python=3.10

conda activate hf-sentiment

pip install numpy conda-pack

pip install torch==1.13.1

pip install transformers==4.21.3

conda install -c conda-forge gcc=12.1.0 # optional if you get issue "nvidia triton version `GLIBCXX_3.4.30' not found"

conda pack -o hf-sentiment.tar.gz

Here is a complete example of a running Hugging Face Sentiment Model (cardiffnlp/twitter-roberta-base-sentiment-latest) Code: https://github.com/satendrakumar/huggingface-triton-server Blog : https://satendrakumar.in/2023/08/07/deploying-hugging-face-model-on-nvidia-triton-inference-server/

satendrakumar avatar Aug 08 '23 05:08 satendrakumar

Hi, I followed the same tutorial to deploy a asr model with language model processor. Its a telugu model. It runs fine everywhere. And also in the docker container if I open a python repl the processor gets loaded without any error, but when I try launching the server, I get this error: try to load on triton server it gives the unicode error ascii' codec can't decode byte 0xe0 in position 0 ordinal not in range(128). I tried adding encoding as utf8 in python file as well but it doesnt work. I followed the tutorial for python_vit itself where model and processor simply gets used in python file without exporting to onnx. Can you please provide some guidance what changes should i do

Jank14 avatar Feb 25 '24 06:02 Jank14

Hi @Jank14

Seems like the Issue with Input/Output, Make sure the test you are sending to the endpoint is encoded in base 64, If you can share the sample input happy to help

nickaggarwal avatar Feb 25 '24 22:02 nickaggarwal

@nickaggarwal would you be interested in authoring a guest post like this https://huggingface.co/blog/mlabonne/merge-models?

Just checking with @osanseviero -- it should be okay no?

sayakpaul avatar Feb 26 '24 02:02 sayakpaul

@nickaggarwal would you be interested in authoring a guest post like this https://huggingface.co/blog/mlabonne/merge-models?

Just checking with @osanseviero -- it should be okay no?

@sayakpaul Sure, We do love this for Nvidia triton, with Ensemble models

nickaggarwal avatar Feb 26 '24 04:02 nickaggarwal

Hi @Jank14

Seems like the Issue with Input/Output, Make sure the test you are sending to the endpoint is encoded in base 64, If you can share the sample input happy to help

Hey actually it was issue with triton server not recognising the telugu tokens, got it resolved by running export PYTHONIOENCODING="utf-8" and apt-get install locales && locale-gen en_US.UTF-8. Thanks

Jank14 avatar Feb 27 '24 05:02 Jank14

@Jank14 hey do you mind sharing ASR Triton inference code.

I am pretty much stuck on understand how did you convert the ASR model to work with the config file.

StephennFernandes avatar Mar 25 '24 18:03 StephennFernandes