jeffye-dev
jeffye-dev
I want to reproduce the DeepSeek-R1-FP4 on B200 deployment solution to align with the blog : https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance However, I just get 40 output tokens per per user, comparing with the...
[test_internode.py] failed on multi-QP: dispatch timeout on ROCE network with testing 2*H20 nodes
When I run the across-node test with `MASTER_ADDR= MASTER_PORT=30001 WORLD_SIZE=2 RANK=0 python test_internode.py` on 2*H20 nodes, I got the following timeout log: ``` DeepEP dispatch NVL receiver timeout, channel: 7,...
In recent SGLANG PD disaggregation integration tests, we found it 100% stuck in DeepEP dispatch-combine call. And the low_latency.py unit test stack looks as below when it's stuck: > __torch_function__...