Jiarui Fang(方佳瑞)
Jiarui Fang(方佳瑞)
He is right, SP and TP can not work together :(
@mauriceweber could you review this PR?
I believe that beam search is a must-have feature for the LLM serving framework.
> @feifeibear ok, we will support this feature soon. Thanks! let me if you need help.
Hello @ctlllll , Thanks for providing such a wonderful project. I am interested in the part of Fine-grained KV cache management. Could you offer me more guidance on this? I...
I install it via requirements.txt The version is list as follows gym 0.26.2 gym-notices 0.0.8
> 结合上面分析和 benchmark 数据,这里怎么理解单用 ulysses 比混用 ulysses 和 ring 性能差?原因是单用 ulysses 做完 alltoall,会把 h 切的比较小,影响了 gemm 计算的密度? 你没正确理解这个图。它想说单机八卡nvlink环境,Ring-Attention性能不如Ulysses。 ulysses比ring整体性能好很多。因为ring把完整attention计算切分了,导致整体计算时间变长。Ulysess增加额外all2all但是时间比例很小。二者比较下来ring就有劣势。
> > > 结合上面分析和 benchmark 数据,这里怎么理解单用 ulysses 比混用 ulysses 和 ring 性能差?原因是单用 ulysses 做完 alltoall,会把 h 切的比较小,影响了 gemm 计算的密度? > > > > > > 你没正确理解这个图。它想说单机八卡nvlink环境,Ring-Attention性能不如Ulysses。 ulysses比ring整体性能好很多。因为ring把完整attention计算切分了,导致整体计算时间变长。Ulysess增加额外all2all但是时间比例很小。二者比较下来ring就有劣势。 > > 嗯嗯,感谢你的工作和回复。图里表达的逻辑,我没有疑问哈;...
> The idea is actually feasible. However, we have not yet tested whether our approach will cause the gpu to reach compute bound too fast, thereby affecting the overall throughput...