MLSL icon indicating copy to clipboard operation
MLSL copied to clipboard

run intel caffe using multi-node with mlsl on AMD cpus ,stopped at Iteration 0

Open Tron-x opened this issue 7 years ago • 3 comments

when i run intel caffe on multi-node(four node) with mlsl on AMD cpus,something is wrong ,the training stopped at the Iteration 0, when run on single node ,it is ok. image when i htop on evry node image my run instruct is :./scripts/run_intelcaffe.sh --hostfile /opt/caffe/mpd.hosts --network tcp --netmask enp3s0f0 --caffe_bin /opt/caffe/build/tools/caffe --solver /opt/caffe/models/intel_optimized_models/multinode/alexnet_4nodes/solver.prototxt

I think something is wrong with mlsl ,my mlsl version is image because when i run with my own openmpi,it is ok

Tron-x avatar Sep 06 '18 05:09 Tron-x

Hi @Tron-x, could you please specify how do you launch IntelCaffe over OpenMPI? As far as I know IntelCaffe uses MLSL only for multi-node communications. MLSL uses Intel MPI under the hood but can be re-built with OpenMPI support, specify MPIRT = openmpi in MLSL Makefile.

mshiryaev avatar Sep 06 '18 14:09 mshiryaev

hi @mshiryaev, when i use openmpi ,i launch intelcaffe with a case such as : image i use five node ,evey node launch 8 process,openmp thread seted 8 ,one node have 64 cores

Tron-x avatar Sep 06 '18 15:09 Tron-x

Hi @Tron-x Besides of @mshiryaev suggestion to build MLSL with OpenMPI, could you please also try setting environment variable "I_MPI_HYDRA_TOPOLIB=hwloc" to check if this helps to out-of-box MLSL/IntelMPI?

SmorkalovME avatar Sep 07 '18 09:09 SmorkalovME