TensorRT-LLM
TensorRT-LLM copied to clipboard
[feature request] logits processor perfomance issue
Current arch of tensorrtllm means that logits processor (for both executor and batch_manager) will be called independently for each request.
But it is bad approach in terms of performance.
For example if i have throughput 5000token/s without custom logits processors then that is equivalent to the system generating each token every 200 µs.
But if i add custom logit processor for each request, it increases latency of each token generation on about 15-20 µs (it depends on complexity of logits processor).
It leads to performance drop of all system on about 10%.
This can be avoided by allowing batch processing of logits for queries.