stylable The performance of many execute times are different

Phenomenon Thanks for your excellent works! Recently I ran example pytorch/benchmark_byteps.py with RDMA distributed traning based on https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md and found a strange phenomenon, which is the speed perfomance with many re-execute training shell are different in the same envrionment. For example, I run the local mode of 2worker+2server (two machine, each one have a worker and a server):

the result of first time:

Then using ctrl+c to stop running, killing all python3 process by command ps -ef|grep python3|grep -v grep|awk '{print$2}'|xargs kill -9 and then re-executing the same training bash, the performance are different:

the result of second time:

Again, using ctrl+c to stop running, killing all python3 process by command ps -ef|grep python3|grep -v grep|awk '{print$2}'|xargs kill -9 and then re-executing the same training bash, the performance becomes normal amazingly :

the result of the third time

These three experiments are running in the two same machines.

Environment (please complete the following information):

OS: centos 7.4
byteps version: 0.2.5
CUDA and NCCL version: 11.0 and cuda 2.4.7
Framework (TF, PyTorch, MXNet): Pytorch 1.4.0
Model: VGG 19
batch size: 128

The executing shell and docker are the same as https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md

Dec 08 '20 11:12 wuyujiji

Can you verify your RDMA performance using this benchmark? https://github.com/bytedance/ps-lite#2-benchmark-with-ipc-support

Dec 08 '20 11:12 ymjiang

Using it to test, it is normal, otherwise I cannot get the stable result of the first and third times.

Dec 08 '20 11:12 wuyujiji

This is the output of test_ipc_benchmark:

Dec 09 '20 05:12 wuyujiji