enozhu

Results 2 issues of enozhu

when we use fp8 data type , we found ffn gemm/atten prj support real fp8 comute(this is supported on H20、L20), but Q*transopse(Key) or softmax * value in attention dosen't support...

question
stale

reference: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/quantization-in-TRT-LLM.md#performance ![image](https://github.com/user-attachments/assets/1bb20225-3eb2-4641-b5ba-f027e8bbddf2) Model Batch Size Speedup (FP8 v.s. FP16) Speedup (INT8 SQ v.s. FP16) GPT-J 1 1.40x 1.40x GPT-J 8 1.44x 1.30x LLaMA-v2-7B 1 1.51x 1.47x LLaMA-v2-7B 8 1.40x...