施庆章 comments

Results 25 comments of


                                            施庆章

trafficstars

Input tensor 'host_sink_token_length' not found when launch llama2-7b.

same error +1 Tensorrtllm_backend: v0.8.0 model: llama7b

[Feature Request] 能否增加用python调用这些cuda kernal的test脚本呢？

> 以 python 调用 cuda 函数时，你可能还会受到 python 自身性能的影响，而无法充分利用高性能算子的优势，这一现象在batch=1时尤为明显，以同样的 kernel 运行 7b 模型，以 python 驱动 cuda 完成调用时，系统延迟大约为17-18ms，PPL将做到12ms以内。感谢回复确实，python会在kernal launch部分耗时较多，采用cudagraph可以加速一下，但是输入得固定。

When will support qwen1.5

https://github.com/Tlntin/Qwen-TensorRT-LLM It seems implement qwen2。

There was a strange computation error between standard attention and flash-attention2

> leizhao1234 hello. 、 I had the same problem with bf16, do you have a solution?

There was a strange computation error between standard attention and flash-attention2

> Are you referring to the bf16 computation error issue? yes，once multiplied the f_attn_mask by 100, the same computation error issue.