Jiemin Lin
Jiemin Lin
i find the same bug. all ENVs cannot fix this. this is caused by `/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp` in line 132, which set `kProcessGroupNCCLDefaultTimeout` to `10 * 60 * 1000`. however, in my...
and i find that many users see the same bug. i wonder that if there is any method to fix this. cc @burling @Orangels @CallmeZhangChenchen @aooxin in [issue 3368](https://github.com/sgl-project/sglang/issues/3368) cc...
> Does it work to set `--watchdog-timeout 3600`? This allows for 1 hour timeout. no, it doesn't work. the default value of `watchdog-timeout` is `300` (in `sglang\srt\server_args.py`, line 81), so...
> I suspect that this is not due to slow weight loading, there might be some communication issues cross-node. I will look into it. i appreciate for your help. i...
i will try it later. thx!