run intel caffe using multi-node with mlsl on AMD cpus ,stopped at Iteration 0
when i run intel caffe on multi-node(four node) with mlsl on AMD cpus,something is wrong ,the training stopped at the Iteration 0, when run on single node ,it is ok.
when i htop on evry node
my run instruct is :./scripts/run_intelcaffe.sh --hostfile /opt/caffe/mpd.hosts --network tcp --netmask enp3s0f0 --caffe_bin /opt/caffe/build/tools/caffe --solver /opt/caffe/models/intel_optimized_models/multinode/alexnet_4nodes/solver.prototxt
I think something is wrong with mlsl ,my mlsl version is
because when i run with my own openmpi,it is ok
Hi @Tron-x, could you please specify how do you launch IntelCaffe over OpenMPI? As far as I know IntelCaffe uses MLSL only for multi-node communications. MLSL uses Intel MPI under the hood but can be re-built with OpenMPI support, specify MPIRT = openmpi in MLSL Makefile.
hi @mshiryaev, when i use openmpi ,i launch intelcaffe with a case such as :
i use five node ,evey node launch 8 process,openmp thread seted 8 ,one node have 64 cores
Hi @Tron-x Besides of @mshiryaev suggestion to build MLSL with OpenMPI, could you please also try setting environment variable "I_MPI_HYDRA_TOPOLIB=hwloc" to check if this helps to out-of-box MLSL/IntelMPI?