improved-diffusion
improved-diffusion copied to clipboard
Training Error about mpi
System: Ubuntu 18.04.6 I follow the instruction following to install openmpi:
wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.1.tar.gz
tar -zxvf openmpi-5.0.1.tar.gz
cd openmpi-5.0.1
./configure --prefix=$HOME/openmpi CC=gcc CXX=g++ --disable-mpi-fortran --disable-mca-dso
make
make install
And I add two lines in the bottom of file ~/.bashrc to update environment variables:
export PATH=$HOME/openmpi/bin:$PATH
export LD_LIBRARY_PATH=$HOME/openmpi/lib:$LD_LIBRARY_PATH
Then I install mpi4py:
conda install mpi4py
When I run the train.py script, it goes error:
ImportError: libmpi.so.12: cannot open shared object file: No such file or directory
Then I follow the suggest online to install openmpi:
conda install openmpi
Now the error is:
--------------------------------------------------------------------------
The value of the MCA parameter "plm_rsh_agent" was set to a path
that could not be found:
plm_rsh_agent: ssh : rsh
Please either unset the parameter, or check that the path is correct
--------------------------------------------------------------------------
[718c7e141fd5:75498] [[INVALID],INVALID] FORCE-TERMINATE AT Not found:-13 - error plm_rsh_component.c(327)
[718c7e141fd5:75498] *** Process received signal ***
[718c7e141fd5:75498] Signal: Segmentation fault (11)
[718c7e141fd5:75498] Signal code: Address not mapped (1)
[718c7e141fd5:75498] Failing at address: (nil)
[718c7e141fd5:75498] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f82eb125980]
[718c7e141fd5:75498] *** End of error message ***
[718c7e141fd5:75416] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 716
[718c7e141fd5:75416] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 172
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_ess_init failed
--> Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: ompi_rte_init failed
--> Returned "Unable to start a daemon on the local node" (-127) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[718c7e141fd5:75416] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
Then I couldn't solve the problem although I have tried many methods. Could anybody help me? Thanks a lot!