cccpr

Results 67 comments of cccpr

@Azure-Tang Is bf16 support done? Have you made a PR elsewhere?

@Tmn07 Does this PR also support per-channel-w8, per-token-a8( just as the most common quantization settings as in smoothquant,etc.) ?

I tried to use `pdb.set_trace()` (following the tutorial [here](https://triton-lang.org/main/programming-guide/chapter-3/debugging.html))to debug the triton kernel in lightllm, but got the following error: `AssertionError: Function "set_trace" is being called from a Triton function...

@hiworldwzj 请问lightllm的w8a8的Triton kernel,是否在llama上测试过相比于fp16的加速效果的benchMark?

@gordicaleksa I only have 4050 NVIDIA gpu(I have no A100 or v100), can the code be running on my gpu?

@HandH1998 I have tried QQQ-w4a8-no-group version on internVL-20B on my own task, the embarrassing thing is that, compared to w8a8, the w4a8 is faster on decoding speed indeed as expected,...

@HandH1998 What does TTFT(ms) and TPOT(ms) actually mean in your chart?

@HandH1998 For `sq-w8a8` in your chart, which specific kernel are you refering? In my experiments, I used the official w8a8 kernel from vLLM(cutlass backend).

> TPOT: Time Per decoding Output Token **TPOT** has already includes the first decoding time? or you have excluded first token time away?

> @brisker It it normal that w4a8 first-token is slower than w8a8, since the additional dequant operation (on slower cuda core) of w4a8 slows down tha main loop, even though...