MPICH with NVIDIA Compilers
Hi Mpich Team,
I have build MPICH with NVIDIA compilers (nvc, nvc++ nvfortran) on TACC Vista machine. Though srun works but mpiexec job launcher results in following errors. Any suggestions?
i615-001gg$ mpiexec -np 16 -ppn 2 ./namd3_mpi_smp_fftw3 +ppn 71 +pemap 1-71,73-143 +commap 0,72 stmv.namd [proxy:[email protected]] created hwloc xml file /tmp/hydra_hwloc_xmlfile_QmhOmh [proxy:[email protected]] created hwloc xml file /tmp/hydra_hwloc_xmlfile_kYI4Ja [proxy:[email protected]] created hwloc xml file /tmp/hydra_hwloc_xmlfile_7fPRik [proxy:[email protected]] created hwloc xml file /tmp/hydra_hwloc_xmlfile_bjz7BQ [proxy:[email protected]] created hwloc xml file /tmp/hydra_hwloc_xmlfile_LGXVSr [proxy:[email protected]] created hwloc xml file /tmp/hydra_hwloc_xmlfile_4GtuuA [proxy:[email protected]] created hwloc xml file /tmp/hydra_hwloc_xmlfile_ud3CVC [proxy:[email protected]] created hwloc xml file /tmp/hydra_hwloc_xmlfile_uKHjRx [proxy:[email protected]] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed [proxy:[email protected]] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed [proxy:[email protected]] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed [proxy:[email protected]] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed [proxy:[email protected]] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed [proxy:[email protected]] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed [proxy:[email protected]] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed [proxy:[email protected]] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed Abort(878831119) on node 2: Fatal error in internal_Init_thread: Other MPI error, error stack: internal_Init_thread(49255)...: MPI_Init_thread(argc=0xfffff342b99c, argv=0xfffff342b990, required=1, provided=0xfffff342b988) failed MPII_Init_thread(265).........: MPIR_init_comm_world(34)......: MPIR_Comm_commit(800).........: MPIR_Comm_commit_internal(585): MPID_Comm_commit_pre_hook(151): MPIDI_world_pre_init(640).....: MPIDI_UCX_init_world(263).....: initial_address_exchange(79)..: MPIDU_bc_table_create(153)....: MPIR_pmi_allgather_shm(690)...: get_ex_segs(431)..............: (unknown)(): Other MPI error
Which version of MPICH is this? Could you try the latest release?
Its the latest 4.2.3 version.
Could you add -v -l option to mpiexec and upload the console log?
Here is the log file, run.log
The main error is [[email protected]] Launch arguments: /usr/bin/srun -N 8 -n 8 --input none --external-launcher /scratch/projects/compilers/nvidia24/mpich/4.2.3_cpu/bin/hydra_pmi_proxy --control-port i615-001.vista.tacc.utexas.edu:45341 --debug --rmk slurm --launcher slurm --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id -1 [proxy:[email protected]] HYDU_create_process (lib/utils/launch.c:73): execvp error on file 1 (No such file or directory)
Could you try ? -
mpiexec -v -np 16 -ppn 2 ./namd3_mpi_smp_fftw3 +ppn 71 +pemap 1-71,73-143 +commap 0,72 stmv.namd
Any update on this ticket?
Sorry for neglect. Could you try the newest MPICH 4.3.0rc1 release (https://www.mpich.org/downloads/), and if it still fails, upload the run log?
Hi @hzhou we have seen a similar issue on Vista when working on MVAPICH.
~~From my testing it appears to be related to the --enable-fast=ndebug configure flag. Manually setting --enable-fast=02,alwaysinline instead of --enable-fast=all resolves the issue. However, it comes at the cost of significantly reduced small message intra-node performance so it is not a usable workaround for us.~~
Looks to me like the NVIDIA compiler performs some kind of unwanted optimization that is leading to this issue for us when NDEBUG is defined. Any thoughts on where to look?
Edit: Looks like I had a typo in my configure and set 02 and not O2. ndebug was not at fault, it was the O2 optimizations. That explains why performance was impacted as well and makes much more sense.
Conclusion: the NVC compiler are not able to compile uthash macros correctly in hydra, resulting an inserted key couldn't be found afterward. This is likely a compiler issue.
Work-around: build hydra using gcc. NVC works fine building libmpi.so.
Since we are unlikely to do anything about it, I am closing this issue. If the suggested work around does not work for you, please re-open the issue.