lukec

Results 21 comments of lukec

> @sleepcoo Awesome! Thanks for your contribution! Before I get into review, could you double-check the new kernel produces correct outputs? When I tested it out, it didn't match our...

> Hi @sleepcoo, Is the bug fixed now? We will add the code format checker later. 🙏 Could you wrap up this PR first? I fixed it, you can review...

> > > Hi @sleepcoo, Is the bug fixed now? We will add the code format checker later. 🙏 Could you wrap up this PR first? > > > >...

> Hi @sleepcoo, thanks for submitting the PR and sorry for the delay in my review. I left some comments on the code style. > > BTW, could you update...

Recently, I learned vllm and found `typename Vec < scalar_t, pack_size > :: Type`. I found that the half kernel implemented before is not very elegant. This submission only involves...

This is really important to me, I use the google style locally, but there are many formatting conflicts with the code you submitted

I have implemented a simple version of the prefix cache function, which shows significant performance improvement in specific scenarios. Do you require this feature? If so, I can prepare a...

> > I have implemented a simple version of the prefix cache function, which shows significant performance improvement in specific scenarios. Do you require this feature? If so, I can...

> @sleepcoo Any way I could be helpful here? I am interested in working on this too. You can try the implementation at https://github.com/vllm-project/vllm/pull/1669, it's quite comprehensive. I've given up...

> 完全offload GPU推理主要需要修改的部分是FFN的计算图,我们需要提供一个快速路径,在其中去除和CPU-GPU混合计算相关的部分,比如GPU index和GPU bucket此时是不需要的。这部分的代码在`llama.cpp/llm_build_ffn_sparse`函数。我们会很快着手进行这项工作。此外,当我们不需要考虑CPU-GPU混合运算时,在底层的GPU算子上,也可以类似地提供一个快速路径,代码在`ggml-cuda.cu`中。 > > Attention层的稀疏性视不同的模型差异较大,在目前我们支持的模型并没有显著体现。因此在此开源代码中我们没有计划支持。可以参考 #111 中的讨论。 > > The primary modifications needed for complete GPU offload are in the FFN's computation graph. We need to provide...