SnippetZero comments

Results 7 comments of


                                            SnippetZero

load_in_8bit可以初始为False

使用alpaca-lora finetune 测试了load_in_8bit=True 和False的两种情况，结果load_in_8bit 为 True的情况下，训练速度相比为False 慢了一倍多？这可能是什么原因了？

lookahead with do_sample=True does not take temperature, top_k, top_p

@zheyishine Hi, Has there been any progress?

[Feature] Speculative Decoding

> In fact, we have already implemented the Medusa TreeMask version in LMDeploy. **When batch=1, the acceleration ratio and RPS improvement relative to the main branch are consistent with those...

[Feature] Speculative Decoding

> I will split the internal implementation of the TreeMask version into multiple PRs and then submit them. Thank you, could you share the methods to solve the performance degradation...

[Feature] Speculative Decoding

> EAGLE has a higher computational load than Medusa, but it has a higher acceptance rate. It performs better in large batches compared to Medusa. However, this is just a...

[Bug] The performance of the FA3 attention backend on Hopper is not up to expect.

https://github.com/sgl-project/sglang/pull/6151

Support for Flash Attention 3 for Ampere, Ada, and Hopper in LMDeploy

@lzhangzz Hi，Is there a plan to implement FA3 on the turbomind engine? Thanks！