Pernekhan Utemuratov
Pernekhan Utemuratov
Could you guys share rough timeline on the support of FP8 quantization for Mixtral (MoE) model? cc: @Tracin
trtllm crashes when I give long context requests within the `max-input-length` limits. I believe it happens when total pending requests reach the `max-num-tokens` limit. But why it's not queuing requests...
**Is the feature request related to a problem?** Currently, there are no benchmarking for multi-turn conversations. Sometimes assistant needs to ask for more information before calling the functions. For example:...
TensorRT-LLM has more stats for kv cache, but the backend doesn't have. Can we add the missing ones to the next week's commits? ``` struct KvCacheStats { SizeType32 maxNumBlocks; SizeType32...