TensorRT-LLM
TensorRT-LLM copied to clipboard
[FeatureRequest] Collecting additional model outputs in `GenerationSession`
Hello team,
We are training our GPT style models to produce additional outputs besides the logit tensor. One use case for us is token-wise classification. Another use case is attention scores from a particular head (this is much more complex I think). During inference we would like to collect this token-wise output preferably with the GenerationSession
.
We had a look at the how to debug tutorial and managed to access our additional model output in the GenerationSession.debug_buffer
. But of course this is a debug feature and probably nothing that can be used in production. One reason for this is that the debug buffer is written to the file system. Also, the debug output is not collected during generation.
So what are actually looking for is a feature that collects additional model output in the GenerationSession
and that works similar to gather_all_token_logits
. In other words: some kind of generalization to arbitrary/multiple outputs.
I hope I was able to describe this feature clearly. Please ask if you have further questions.
Is something like this on your roadmap? We would really appreciate it if you could consider this feature request.
This feature will also be very useful for GptManager
- https://github.com/NVIDIA/TensorRT-LLM/issues/595
- https://github.com/NVIDIA/TensorRT-LLM/issues/868