xutizhou

Results 3 issues of xutizhou

Hi team, Thank you for your excellent work, I wonder if this repo could support Ada Lovelace architecture such as L20 GPU. Thanks

dispatch bandwidth is around 20GB/s while combine bandwidth is near 50GB/s peak.

I have tested node2/node4/node4 normal mode deepep, and always encounter deepep timeout check failed when num_tokens=128. Here is my test code. ```python def test_loop(local_rank: int, num_local_ranks: int, args: argparse.Namespace): num_nodes...