jan epic: Jan supports TensorRT-LLM, Triton Server and Nvidia Professional/Datacenter-grade GPUs

epic: Jan supports TensorRT-LLM, Triton Server and Nvidia Professional/Datacenter-grade GPUs

Open dan-menlo opened this issue 1 year ago • 1 comments

https://www.notion.so/jan-ai/Triton-Server-and-TensorRT-LLM-Support-32b5101c595a49489cfe37680e769dd2?pvs=4

Jan 25 '24 07:01 dan-menlo

References:

The PR for extension that supports NVIDIA triton and TensorRT-LLM backend: https://github.com/janhq/jan/pull/888
The steps to setup NVIDIA triton and TensorRT-LLM backend: https://github.com/hamelsmu/llama-inference/blob/master/triton-tensorRT-quantized-awq/README.md
The steps to setup NVIDIA triton and vLLM backend: https://github.com/hamelsmu/llama-inference/blob/master/triton-vllm/README.md

P/S:

Running Triton Inference server with backends is the recommended way, however we are open to try with anything.
The .engine is the combination of model in HF transformer format on specific NVIDIA and cannot be shared to use on another GPU (i.e compile and runtime GPU must be the same).
INT8 K-V can only be used on H100 for faster speed. For other GPU, it will make the model performance worse.
For the setup script, Hiro has been able to run manually step by step but converting to bash script to automatically install tensorrt_llm seems error.
Please make sure you pay attention to triton and tensorrt_llm backend carefully because it's fragile and a lot of breaking change and mostly result in error in a. Error in installing tensorrt_llm or b. Error in converting model to .engine files

Feb 26 '24 02:02 hiro-v

triton done, trt dupe

Jun 11 '24 01:06 freelerobot