jeffye-dev comments

Results 8 comments of


                                            jeffye-dev

[test_internode.py] failed on multi-QP: dispatch timeout on ROCE network with testing 2*H20 nodes

The AR is OFF in my environment. Is it caused by multi-QP. I tried earlier version and find it's working.

UE8M0(PR206) features cause severe a regression issue and cause low-latency stuck

any progress ?

UE8M0(PR206) features cause severe a regression issue and cause low-latency stuck

I use 6*H100 with IB network cards to have low-latency tests, it's 100% reproducible. This stops me using the latest version of DeepEP. I have to use the older DeepEP....

UE8M0(PR206) features cause severe a regression issue and cause low-latency stuck

Thanks，when setting round_scale=False the issue is gone. close this issue then

How to achieve 253 tok/sec with DeepSeek-R1-FP4 on 8xB200

Thanks for explanations. So I have to wait until the MRs are merged and use correct configuation? BTW, enable_attention_dp=false might cause GPU hangs in my case.

How to achieve 253 tok/sec with DeepSeek-R1-FP4 on 8xB200

When will these MRs be merged? I'd like to have a try in time. It's better to have document about reproduce the performance. @juney-nvidia @Kefeng-Duan

How to achieve 253 tok/sec with DeepSeek-R1-FP4 on 8xB200

Thank @Kefeng-Duan for assistance. I did the changes accordingly and ran some tests using trtllm-bench. Now I get very closed result: 207 tok/sec/user when setting batch=1. If set batch=10, the...

How to achieve 253 tok/sec with DeepSeek-R1-FP4 on 8xB200

So the point is MTP=3? It's nice to have MPT feature in production unless MTP decrease accuracy. I cannot see the accepted rate at runtime, it's hard to judge how...