inference
inference copied to clipboard
FEAT: Support TensorRT-LLM backend
Support TensorRT-LLM backend.
- [ ] Implements TRTModel with generate method.
- [ ] Expose
launch_trt_model
to client. - [ ] Doc and example
Python API of in-flight batching is needed for this PR, and TensorRT-LLM team says it will be implemented in next versions.