mpich icon indicating copy to clipboard operation
mpich copied to clipboard

MPICH with NVIDIA Compilers

Open aruhela opened this issue 1 year ago • 9 comments

Hi Mpich Team,

I have build MPICH with NVIDIA compilers (nvc, nvc++ nvfortran) on TACC Vista machine. Though srun works but mpiexec job launcher results in following errors. Any suggestions?

i615-001gg$ mpiexec -np 16 -ppn 2 ./namd3_mpi_smp_fftw3 +ppn 71 +pemap 1-71,73-143 +commap 0,72 stmv.namd [proxy:[email protected]] created hwloc xml file /tmp/hydra_hwloc_xmlfile_QmhOmh [proxy:[email protected]] created hwloc xml file /tmp/hydra_hwloc_xmlfile_kYI4Ja [proxy:[email protected]] created hwloc xml file /tmp/hydra_hwloc_xmlfile_7fPRik [proxy:[email protected]] created hwloc xml file /tmp/hydra_hwloc_xmlfile_bjz7BQ [proxy:[email protected]] created hwloc xml file /tmp/hydra_hwloc_xmlfile_LGXVSr [proxy:[email protected]] created hwloc xml file /tmp/hydra_hwloc_xmlfile_4GtuuA [proxy:[email protected]] created hwloc xml file /tmp/hydra_hwloc_xmlfile_ud3CVC [proxy:[email protected]] created hwloc xml file /tmp/hydra_hwloc_xmlfile_uKHjRx [proxy:[email protected]] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed [proxy:[email protected]] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed [proxy:[email protected]] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed [proxy:[email protected]] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed [proxy:[email protected]] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed [proxy:[email protected]] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed [proxy:[email protected]] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed [proxy:[email protected]] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed Abort(878831119) on node 2: Fatal error in internal_Init_thread: Other MPI error, error stack: internal_Init_thread(49255)...: MPI_Init_thread(argc=0xfffff342b99c, argv=0xfffff342b990, required=1, provided=0xfffff342b988) failed MPII_Init_thread(265).........: MPIR_init_comm_world(34)......: MPIR_Comm_commit(800).........: MPIR_Comm_commit_internal(585): MPID_Comm_commit_pre_hook(151): MPIDI_world_pre_init(640).....: MPIDI_UCX_init_world(263).....: initial_address_exchange(79)..: MPIDU_bc_table_create(153)....: MPIR_pmi_allgather_shm(690)...: get_ex_segs(431)..............: (unknown)(): Other MPI error

aruhela avatar Oct 18 '24 20:10 aruhela

Which version of MPICH is this? Could you try the latest release?

hzhou avatar Oct 18 '24 23:10 hzhou

Its the latest 4.2.3 version.

aruhela avatar Oct 18 '24 23:10 aruhela

Could you add -v -l option to mpiexec and upload the console log?

hzhou avatar Oct 18 '24 23:10 hzhou

Here is the log file, run.log

The main error is [[email protected]] Launch arguments: /usr/bin/srun -N 8 -n 8 --input none --external-launcher /scratch/projects/compilers/nvidia24/mpich/4.2.3_cpu/bin/hydra_pmi_proxy --control-port i615-001.vista.tacc.utexas.edu:45341 --debug --rmk slurm --launcher slurm --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id -1 [proxy:[email protected]] HYDU_create_process (lib/utils/launch.c:73): execvp error on file 1 (No such file or directory)

aruhela avatar Oct 19 '24 00:10 aruhela

Could you try ? -

mpiexec -v -np 16 -ppn 2 ./namd3_mpi_smp_fftw3 +ppn 71 +pemap 1-71,73-143 +commap 0,72 stmv.namd

hzhou avatar Oct 19 '24 00:10 hzhou

Hui, here is the log.

run2.log

aruhela avatar Oct 19 '24 01:10 aruhela

Any update on this ticket?

aruhela avatar Dec 21 '24 04:12 aruhela

Sorry for neglect. Could you try the newest MPICH 4.3.0rc1 release (https://www.mpich.org/downloads/), and if it still fails, upload the run log?

hzhou avatar Dec 21 '24 16:12 hzhou

Hi @hzhou we have seen a similar issue on Vista when working on MVAPICH.

~~From my testing it appears to be related to the --enable-fast=ndebug configure flag. Manually setting --enable-fast=02,alwaysinline instead of --enable-fast=all resolves the issue. However, it comes at the cost of significantly reduced small message intra-node performance so it is not a usable workaround for us.~~

Looks to me like the NVIDIA compiler performs some kind of unwanted optimization that is leading to this issue for us when NDEBUG is defined. Any thoughts on where to look?

Edit: Looks like I had a typo in my configure and set 02 and not O2. ndebug was not at fault, it was the O2 optimizations. That explains why performance was impacted as well and makes much more sense.

natshineman avatar Jan 17 '25 15:01 natshineman

Hui here is the error log.

new 14.txt

aruhela avatar Sep 30 '25 12:09 aruhela

Conclusion: the NVC compiler are not able to compile uthash macros correctly in hydra, resulting an inserted key couldn't be found afterward. This is likely a compiler issue.

Work-around: build hydra using gcc. NVC works fine building libmpi.so.

Since we are unlikely to do anything about it, I am closing this issue. If the suggested work around does not work for you, please re-open the issue.

hzhou avatar Oct 08 '25 18:10 hzhou