stylable icon indicating copy to clipboard operation
stylable copied to clipboard

benchmark with cross barrier error

Open panpanli521 opened this issue 3 years ago • 0 comments

I benchmarked the performance of BytePS with cross barrier using the script in /example/pytorch/benchmark_cross_barrier_byteps.py.

The complete commands as follows:

  • scheduler:

export DMLC_NUM_WORKER=2 export DMLC_ROLE=scheduler export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=ip1 export DMLC_PS_ROOT_PORT=1234 export DMLC_INTERFACE=xgbe1 export DMLC_NODE_HOST=ip1 bpslaunch

  • sever1: export DMLC_NUM_WORKER=2 export DMLC_ROLE=server export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=ip1 export DMLC_PS_ROOT_PORT=1234 export DMLC_INTERFACE=xgbe1 export DMLC_NODE_HOST=ip1 bpslaunch

  • sever2: export DMLC_NUM_WORKER=2 export DMLC_ROLE=server export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=ip1 export DMLC_PS_ROOT_PORT=1234 export DMLC_INTERFACE=xgbe1 export DMLC_NODE_HOST=ip2 bpslaunch

  • worker1 export NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 export DMLC_WORKER_ID=0 export DMLC_NUM_WORKER=2 export DMLC_ROLE=worker export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=ip1 export DMLC_PS_ROOT_PORT=1234 # the scheduler port export DMLC_INTERFACE=xgbe1 export DMLC_NODE_HOST=ip3 bpslaunch python3 /usr/local/byteps/example/pytorch/benchmark_cross_barrier_byteps.py --model resnet50 --batch-size 64 --num-iters 500

  • worker2 export NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 export DMLC_WORKER_ID=1 export DMLC_NUM_WORKER=2 export DMLC_ROLE=worker export DMLC_NUM_SERVER=2 export DMLC_PS_ROOT_URI=ip1 export DMLC_PS_ROOT_PORT=1234 export DMLC_INTERFACE=xgbe1 export DMLC_NODE_HOST=ip4 bpslaunch python3 /usr/local/byteps/example/pytorch/benchmark_cross_barrier_byteps.py --model resnet50 --batch-size 64 --num-iters 500

After executing the command, worker1 can print throughout but worker2 is hanging: image

Finished: image

panpanli521 avatar Feb 08 '22 13:02 panpanli521