jan
jan copied to clipboard
epic: Jan supports TensorRT-LLM, Triton Server and Nvidia Professional/Datacenter-grade GPUs
Spec
https://www.notion.so/jan-ai/Triton-Server-and-TensorRT-LLM-Support-32b5101c595a49489cfe37680e769dd2?pvs=4
Issues
References:
- The PR for extension that supports
NVIDIA triton and TensorRT-LLM backend
: https://github.com/janhq/jan/pull/888 - The steps to setup
NVIDIA triton and TensorRT-LLM backend
: https://github.com/hamelsmu/llama-inference/blob/master/triton-tensorRT-quantized-awq/README.md - The steps to setup
NVIDIA triton and vLLM backend
: https://github.com/hamelsmu/llama-inference/blob/master/triton-vllm/README.md
P/S:
- Running Triton Inference server with backends is the recommended way, however we are open to try with anything.
- The
.engine
is the combination ofmodel in HF transformer format
on specific NVIDIA and cannot be shared to use on another GPU (i.e compile and runtime GPU must be the same). - INT8 K-V can only be used on H100 for faster speed. For other GPU, it will make the model performance worse.
- For the setup script, Hiro has been able to run manually step by step but converting to bash script to automatically install
tensorrt_llm
seems error. - Please make sure you pay attention to
triton
andtensorrt_llm
backend carefully because it's fragile and a lot of breaking change and mostly result in error in a. Error in installingtensorrt_llm
or b. Error in converting model to.engine
files
triton done, trt dupe