meiyp

Results 3 issues of meiyp

**环境** nvcr.io/nvidia/tensorflow:21.12-tf1-py3镜像生成的容器 **代码:** FastNN/resnet/resnet_split.py **执行命令:** 服务器1:TF_CONFIG='{"cluster":{"worker":["172.20.21.181:55375","172.20.21.189:55376"]},"task":{"type":"worker","index":0}}' bash scripts/train_split.sh 服务器2:TF_CONFIG='{"cluster":{"worker":["172.20.21.181:55375","172.20.21.189:55376"]},"task":{"type":"worker","index":1}}' bash scripts/train_split.sh 服务器1的执行情况: ![image](https://github.com/alibaba/EasyParallelLibrary/assets/55943192/5bdffa20-3c77-4109-88c6-d1f2fc6d7586) 服务器2的执行情况: ![image](https://github.com/alibaba/EasyParallelLibrary/assets/55943192/1e5d021e-9b53-4ac1-937e-9de9c2f6bc7f) 可以看到服务器1的still waiting只打印了2条就不打印了说明已经接收到了服务器2的回复,但是没有继续往下运行。 **补充:** 同样的环境可以分布式运行bert,服务器之间是可以正常连接跑分布式训练的。 想问下是我的执行问题还是代码需要进行修改?

**环境:** nvcr.io/nvidia/tensorflow:21.12-tf1-py3镜像的容器 **代码:** FastNN/resnet/resnet_split.py **执行命令:** 服务器1:TF_CONFIG='{"cluster":{"worker":["172.20.21.181:55375","172.20.21.189:55376"]},"task":{"type":"worker","index":0}}' bash scripts/train_split.sh 服务器2:TF_CONFIG='{"cluster":{"worker":["172.20.21.181:55375","172.20.21.189:55376"]},"task":{"type":"worker","index":1}}' bash scripts/train_split.sh 服务器1的执行情况: ![image](https://github.com/alibaba/FastNN/assets/55943192/58d97a8f-fa61-4239-a70b-fd8d1c4ba58b) 服务器2的执行情况: ![image](https://github.com/alibaba/FastNN/assets/55943192/f8b7b791-98f6-4fb3-ac4f-3f979f64ee7f) 可以看到服务器1的still waiting只打印了2条就不打印了说明已经接收到了服务器2的回复,但是没有继续往下运行。 **补充:** 同样的环境可以分布式运行bert,服务器之间是可以正常连接跑分布式训练的。 想问下是我的执行问题还是代码需要进行修改?

Because I am using vLLM server to deploy a MoE model. However, this model has a large number of experts and the number of activated experts is very small. So...

enhancement