"double free or corruption" observed running MPICH with Intel Compiler and IntelPython in PATH
Loading the Intel environment before building MPICH with IntelPython activated results in a "double free or corruption" when running some examples - observed with both C and Fortran.
To set up the Intel 2023.2.0 environment:
source <path_to_intel_install>/compiler/2023.2.0/env/vars.sh
source <path_to_intel_install>/intelpython/python3.9/env/vars.sh
I specifically am not using <path_to_intel_install/setvars.sh because it would load Intel MPI.
The intelpython puts libfabric in your PATH. This is detected by MPICH when building. I noticed no change with --with-libfabric=embedded enabled or not.
In both cases, some applications will report a "double free or corruption" at exit.
Reproducer:
$ cat simple.c
#include <string.h>
#include <stdio.h>
#include <unistd.h>
#include <mpi.h>
int my_rank;
int processes;
void func2()
{
int rank = my_rank;
int total = processes;
char message[100];
sprintf(message, "Greetings from process %d!", rank);
memset(message, 0, 100);
}
void func1()
{
MPI_Barrier(MPI_COMM_WORLD); // make staggered stop more likely
sleep(my_rank); // force staggered stop
func2();
}
int main(int argc, char** argv)
{
int numbers[1000];
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &processes);
if (my_rank == 0)
{
fprintf(stdout, "Hello World!\n");
fprintf(stderr, "Upps, something is wrong!\n");
fprintf(stdout, "Number of arguments %d and first is %s\n", argc, argv[0]);
}
int i;
for (i=0; i < 1000; i++)
{
numbers[i] = i;
}
sleep(5);
func1();
MPI_Finalize();
return 0;
}
MPICH Compilation:
$ mpirun --version
HYDRA build details:
Version: 4.0.3
Release Date: Tue Nov 8 09:51:06 CST 2022
CC: icc -m64 -m64
Configure options: '--disable-option-checking' '--prefix=/home/louspe01/.conan/data/mpich/4.0.3/louise/test/package/8ad367ce318146e7032d51425103be5a0064d2ca' '--enable-debug' '--enable-debuginfo' '--enable-shared' 'F90=' '--bindir=${prefix}/bin' '--sbindir=${prefix}/bin' '--libexecdir=${prefix}/bin' '--libdir=${prefix}/lib' '--includedir=${prefix}/include' '--oldincludedir=${prefix}/include' '--datarootdir=${prefix}/share' 'CC=icc' 'CFLAGS=-m64 ' 'LDFLAGS=-m64' 'LIBS=' 'CPPFLAGS= ' 'CXX=icpc' 'CXXFLAGS=-m64 ' 'FC=ifort' 'F77=ifort' '--cache-file=/dev/null' '--srcdir=/home/louspe01/.conan/data/mpich/4.0.3/louise/test/source/mpich-4.0.3/src/pm/hydra'
Process Manager: pmi
Launchers available: ssh rsh fork slurm ll lsf sge manual persist
Topology libraries available: hwloc
Resource management kernels available: user slurm ll lsf sge pbs cobalt
Demux engines available: poll select
Compile reproducer:
mpicc -g simplc.c -o simple
Running Reproducer:
$ ./simple
Hello World!
Upps, something is wrong!
Number of arguments 1 and first is ./simple
double free or corruption (!prev)
Aborted (core dumped)
Backtrace for the error:
$ gdb ./simple
...
(gdb) r
Starting program: /home/louspe01/repo/forge/test/ddtscripts/base/ddt/offline/simple
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff05ff640 (LWP 1662544)]
[New Thread 0x7fffefdfe640 (LWP 1662545)]
Hello World!
Upps, something is wrong!
Number of arguments 1 and first is /home/louspe01/repo/forge/test/ddtscripts/base/ddt/offline/simple
[Thread 0x7ffff05ff640 (LWP 1662544) exited]
[Thread 0x7fffefdfe640 (LWP 1662545) exited]
double free or corruption (!prev)
Thread 1 "simple" received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352271680)
at ./nptl/pthread_kill.c:44
44 ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352271680)
at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=140737352271680) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=140737352271680, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x00007ffff3642476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007ffff36287f3 in __GI_abort () at ./stdlib/abort.c:79
#5 0x00007ffff36896f6 in __libc_message
(action=action@entry=do_abort, fmt=fmt@entry=0x7ffff37dbb8c "%s\n")
at ../sysdeps/posix/libc_fatal.c:155
#6 0x00007ffff36a0d7c in malloc_printerr
(str=str@entry=0x7ffff37de7d0 "double free or corruption (!prev)") at ./malloc/malloc.c:5664
#7 0x00007ffff36a2efc in _int_free
(av=0x7ffff3819c80 <main_arena>, p=0x47f850, have_lock=<optimized out>)
at ./malloc/malloc.c:4591
#8 0x00007ffff36a54d3 in __GI___libc_free (mem=<optimized out>) at ./malloc/malloc.c:3391
#9 0x00007ffff0e09571 in ofi_cleanup_prov ()
at /home/louspe01/.conan/data/intel_installation/2023.2.0/louise/test/package/4f459a94dbd4c62b669f92843b0daa45ca1e3751/mpi/2021.10.0//libfabric/lib/libfabric.so.1
#10 0x00007ffff0e08dcf in fi_fini ()
at /home/louspe01/.conan/data/intel_installation/2023.2.0/louise/test/package/4f459a94dbd4c62b669f92843b0daa45ca1e3751/mpi/2021.10.0//libfabric/lib/libfabric.so.1
#11 0x00007ffff7fc924e in _dl_fini () at ./elf/dl-fini.c:142
#12 0x00007ffff3645495 in __run_exit_handlers
(status=0, listp=0x7ffff3819838 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:113
#13 0x00007ffff3645610 in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:143
#14 0x00007ffff3629d97 in __libc_start_call_main
(main=main@entry=0x40261e <main>, argc=argc@entry=1, argv=argv@entry=0x7fffffffa848)
at ../sysdeps/nptl/libc_start_call_main.h:74
#15 0x00007ffff3629e40 in __libc_start_main_impl
(main=0x40261e <main>, argc=1, argv=0x7fffffffa848, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffa838) at ../csu/libc-start.c:392
#16 0x00000000004024c5 in _start ()
We used to see this due to psm3 provider in libfabric. I believe they fixed this for a while now. Could you try build the latest version of libfabric or try build the current MPICH from https://github.com/pmodels/mpich using embedded?