tensorrtllm_backend
tensorrtllm_backend copied to clipboard
Added documentation of using warmups to initialize lora weights
This PR provides documentation for converting lora adapters from a hugging face checkpoint into a warmup that can be used in the triton-inference-server TensorRT-LLM backend.
This approach allows for the LoRa weights to never be required for the client of the triton-inference-server backend and does not require loading or passing these weights from any of the python
backend models (preprocessing) to avoid the numpy datatype conversion (which does not support bfloat16
)