TensorRT-LLM
TensorRT-LLM copied to clipboard
[FeatureRequest] Gather sparse logprobs
Hello team,
We typically use gather_all_token_logits
to collect the logit tensors for post-processing. Especially for large vocabulary sizes (128 000) this can require a lot of GPU memory. For example, when running inference loads with input and output lengths of 1024 and a batch size of 32, the collected logit tensor requires 32 GB of memory (fp32).
In vllm it is possible to collect only the topk logprobs (see here). This is much more memory efficient and would be sufficient for our purposes. Is there currently a way to do this in TensorRT-LLM as well? If not, we would really appreciate this feature in both ModelRunner
and ModelRunnerCpp
.
This issue is somehow related to https://github.com/NVIDIA/TensorRT-LLM/issues/1040, as it would be possible to solve it on our side if it is possible to collect arbitrary model outputs.
Thank you