tensorrtllm_backend icon indicating copy to clipboard operation
tensorrtllm_backend copied to clipboard

Added documentation of using warmups to initialize lora weights

Open TheCodeWrangler opened this issue 7 months ago • 2 comments

This PR provides documentation for converting lora adapters from a hugging face checkpoint into a warmup that can be used in the triton-inference-server TensorRT-LLM backend.

This approach allows for the LoRa weights to never be required for the client of the triton-inference-server backend and does not require loading or passing these weights from any of the python backend models (preprocessing) to avoid the numpy datatype conversion (which does not support bfloat16)

TheCodeWrangler avatar Jun 27 '24 22:06 TheCodeWrangler