cccpr comments

Results 67 comments of


                                            cccpr

Why the linear input for Layer.pack "must be of type `torch.half`"?

@Azure-Tang Is bf16 support done? Have you made a PR elsewhere?

[Model] [Quantization] Support deepseek v3/r1 w8a8 int8 block-wise quantization

@Tmn07 Does this PR also support per-channel-w8, per-token-a8( just as the most common quantization settings as in smoothquant,etc.) ?

torch, triton版本确认及显存占用分析

I tried to use `pdb.set_trace()` (following the tutorial [here](https://triton-lang.org/main/programming-guide/chapter-3/debugging.html))to debug the triton kernel in lightllm, but got the following error: `AssertionError: Function "set_trace" is being called from a Triton function...

torch, triton版本确认及显存占用分析

@hiworldwzj 请问lightllm的w8a8的Triton kernel，是否在llama上测试过相比于fp16的加速效果的benchMark？

Pretraining (with CPUs)

@gordicaleksa I only have 4050 NVIDIA gpu(I have no A100 or v100), can the code be running on my gpu?

Condition to achieve linear speedup?

@HandH1998 I have tried QQQ-w4a8-no-group version on internVL-20B on my own task, the embarrassing thing is that, compared to w8a8, the w4a8 is faster on decoding speed indeed as expected,...

Condition to achieve linear speedup?

@HandH1998 What does TTFT(ms) and TPOT(ms) actually mean in your chart?

Condition to achieve linear speedup?

@HandH1998 For `sq-w8a8` in your chart, which specific kernel are you refering? In my experiments, I used the official w8a8 kernel from vLLM(cutlass backend).

Condition to achieve linear speedup?

> TPOT: Time Per decoding Output Token **TPOT** has already includes the first decoding time? or you have excluded first token time away?

Condition to achieve linear speedup?

> @brisker It it normal that w4a8 first-token is slower than w8a8, since the additional dequant operation (on slower cuda core) of w4a8 slows down tha main loop, even though...