stylable icon indicating copy to clipboard operation
stylable copied to clipboard

The performance of many execute times are different

Open wuyujiji opened this issue 5 years ago • 3 comments

Phenomenon Thanks for your excellent works! Recently I ran example pytorch/benchmark_byteps.py with RDMA distributed traning based on https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md and found a strange phenomenon, which is the speed perfomance with many re-execute training shell are different in the same envrionment. For example, I run the local mode of 2worker+2server (two machine, each one have a worker and a server):

the result of first time:

image

Then using ctrl+c to stop running, killing all python3 process by command ps -ef|grep python3|grep -v grep|awk '{print$2}'|xargs kill -9 and then re-executing the same training bash, the performance are different:

the result of second time:

image

Again, using ctrl+c to stop running, killing all python3 process by command ps -ef|grep python3|grep -v grep|awk '{print$2}'|xargs kill -9 and then re-executing the same training bash, the performance becomes normal amazingly :

the result of the third time

image

These three experiments are running in the two same machines.

Environment (please complete the following information):

  • OS: centos 7.4
  • byteps version: 0.2.5
  • CUDA and NCCL version: 11.0 and cuda 2.4.7
  • Framework (TF, PyTorch, MXNet): Pytorch 1.4.0
  • Model: VGG 19
  • batch size: 128

The executing shell and docker are the same as https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md

wuyujiji avatar Dec 08 '20 11:12 wuyujiji

Can you verify your RDMA performance using this benchmark? https://github.com/bytedance/ps-lite#2-benchmark-with-ipc-support

ymjiang avatar Dec 08 '20 11:12 ymjiang

Using it to test, it is normal, otherwise I cannot get the stable result of the first and third times.

wuyujiji avatar Dec 08 '20 11:12 wuyujiji

This is the output of test_ipc_benchmark: image

wuyujiji avatar Dec 09 '20 05:12 wuyujiji