jeffye-dev

Results 8 comments of jeffye-dev

The AR is OFF in my environment. Is it caused by multi-QP. I tried earlier version and find it's working.

I use 6*H100 with IB network cards to have low-latency tests, it's 100% reproducible. This stops me using the latest version of DeepEP. I have to use the older DeepEP....

Thanks,when setting round_scale=False the issue is gone. close this issue then

Thanks for explanations. So I have to wait until the MRs are merged and use correct configuation? BTW, enable_attention_dp=false might cause GPU hangs in my case.

When will these MRs be merged? I'd like to have a try in time. It's better to have document about reproduce the performance. @juney-nvidia @Kefeng-Duan

Thank @Kefeng-Duan for assistance. I did the changes accordingly and ran some tests using trtllm-bench. Now I get very closed result: 207 tok/sec/user when setting batch=1. If set batch=10, the...

So the point is MTP=3? It's nice to have MPT feature in production unless MTP decrease accuracy. I cannot see the accepted rate at runtime, it's hard to judge how...