TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

[feature request] logits processor perfomance issue

Open akhoroshev opened this issue 9 months ago • 0 comments

Current arch of tensorrtllm means that logits processor (for both executor and batch_manager) will be called independently for each request.

But it is bad approach in terms of performance.

For example if i have throughput 5000token/s without custom logits processors then that is equivalent to the system generating each token every 200 µs.

But if i add custom logit processor for each request, it increases latency of each token generation on about 15-20 µs (it depends on complexity of logits processor).

It leads to performance drop of all system on about 10%.

This can be avoided by allowing batch processing of logits for queries.

akhoroshev avatar May 27 '24 07:05 akhoroshev