DeepEP
DeepEP copied to clipboard
When deploying with multiple machines, the following error was encountered
When deploying PD-separated Deepseek v3 and running multi-machine Decode with two machines, it works normally. However, when using four machines, the following errors occur. I hope to get your help. Figure 1 shows the error on the master machine, and Figure 2 shows the error on the sub machine. I'm not sure whether this error is related to DeepEP or not. Can anyone provide some help?
It seems that your program encountered an issue while establishing TCP socket connections. Please check your TCP network environment.