HongyuChen

Results 12 comments of HongyuChen

Thank you for your reply. By "single-stream computational pipeline" do you mean that the time spent loading model weights from the HBM to the Cache will be counted in the...

If my understanding is correct, is there data parallelism between nodes and tensor parallelism inside nodes in the documentation `offline_inference_distributed.py`?

> Hi lequn, I think I found the bug of cutlass_shrink. > > Please first see [cutlass example 24 group gemm](https://github.com/NVIDIA/cutlass/blob/a75b4ac483166189a45290783cb0a18af5ff0ea5/examples/24_gemm_grouped/gemm_grouped.cu#L1529). The second parameter for `LinearCombination ` should `128 /...

@yzh119 Thanks for the reply, bro. I tried a smaller tile size as you suggested, and the performance did get an improvement (around 20%). But this still doesn't perform well...

I defined a model myself and called bgmv in it to do some LoRA calculations, so indices=-1 resulted in cuda error. > I don't think LoRA should be captured in...

Yes I don't want lora to be captured indeed. I think my error was caused by my missuse of bgmv kernel.

I'm currently using version 0.7.2 I think I'll try cudagraph for lora of version 0.8 in the future Also I'd like to ask a question, enabling cudagraph for lora doesn't...

> The 0.7.2 version should still be the V0 version of LoRA. For V0, vllm only captures cudagraph during the decode stage, and lora supports cudagraph, which you can confirm...

UPDATE: the running result: will not report error (computation is finished), but `cutlass::reference::host::TensorEquals` failed

> UPDATE: the running result: will not report error (computation is finished), but `cutlass::reference::host::TensorEquals` failed Maybe such an error is an accuracy issue?