TensorRT-LLM [FeatureRequest] Gather sparse logprobs

[FeatureRequest] Gather sparse logprobs

Open Marks101 opened this issue 10 months ago • 6 comments

Hello team,

We typically use gather_all_token_logits to collect the logit tensors for post-processing. Especially for large vocabulary sizes (128 000) this can require a lot of GPU memory. For example, when running inference loads with input and output lengths of 1024 and a batch size of 32, the collected logit tensor requires 32 GB of memory (fp32).

In vllm it is possible to collect only the topk logprobs (see here). This is much more memory efficient and would be sufficient for our purposes. Is there currently a way to do this in TensorRT-LLM as well? If not, we would really appreciate this feature in both ModelRunner and ModelRunnerCpp.

This issue is somehow related to https://github.com/NVIDIA/TensorRT-LLM/issues/1040, as it would be possible to solve it on our side if it is possible to collect arbitrary model outputs.

Thank you

Apr 08 '24 09:04 Marks101

TensorRT-LLM TensorRT-LLM copied to clipboard

[FeatureRequest] Gather sparse logprobs

TensorRT-LLM
TensorRT-LLM copied to clipboard