Have valgrind suppression files installed
Hi,
we are compiling and running with an mpich configured with "--enable-g=dbg,meminit" option, we have installed valgrind-devel package.
Configuring MPICH version 4.2.2 with '--prefix=/opt/mpi/mpich-4.2.2-debug' '--enable-debuginfo' '--enable-g=dbg,meminit' 'CC=/usr/bin/gcc' 'CXX=/usr/bin/g++' 'FC=/usr/bin/gfortran' 'F77=/usr/bin/gfortran' '--with-device=ch3:sock' '--enable-romio'
We saw there are some valgrind suppression files into de repo but they are not "installed" when we launch a "make install".
find . -name "*valg*"
./modules/hwloc/contrib/hwloc-valgrind.supp
./modules/ucx/contrib/valgrind.supp
./src/pm/hydra/modules/hwloc/contrib/hwloc-valgrind.supp
Is this normal?
Mpich finds valgrind headers:
checking valgrind/valgrind.h usability... yes
checking valgrind/valgrind.h presence... yes
checking for valgrind/valgrind.h... yes
checking valgrind/memcheck.h usability... yes
checking valgrind/memcheck.h presence... yes
checking for valgrind/memcheck.h... yes
checking helgrind.h usability... no
checking helgrind.h presence... no
checking for helgrind.h... no
checking valgrind/helgrind.h usability... yes
checking valgrind/helgrind.h presence... yes
checking for valgrind/helgrind.h... yes
checking drd.h usability... no
checking drd.h presence... no
checking for drd.h... no
checking valgrind/drd.h usability... yes
checking valgrind/drd.h presence... yes
checking for valgrind/drd.h... yes
checking whether the valgrind headers are broken or too old... no
I also found this in src/pm/hydra/modules/hwloc/Makefile.in:
# Only install the valgrind suppressions file if we're building in
# standalone mode
@HWLOC_BUILD_STANDALONE_TRUE@dist_pkgdata_DATA = contrib/hwloc-valgrind.supp
all: all-recursive
What is the "standalone mode" ?
Thanks,
Eric
(I asked for this in Oct. 2024: https://lists.mpich.org/pipermail/discuss/2024-October/006701.html).
Thanks for creating the issue. MPICH itself do not have valgrind suppression file - we may used to have, but I don't see any now. The ones you found is from hwloc and ucx respectively, both from third party but MPICH have the option to build them as embedded if they are not available from the "system". We don't install or expose anything if they are build as an internal dependency -- referred as the "embedded" mode. If you install hwloc or ucx separately, then they are referred as the "standalone" mode. MPICH will skip the embedded modules if they are found in the system during configure.
Thanks @hzhou.
My willingness about this request was justified by the presence of a valgrind suppression file maintained by PETSc guys:
https://gitlab.com/petsc/petsc/-/commits/5d8720fa41fb4169420198de95a3fb9ffc339d07/share/petsc/suppressions/valgrind
but as of Feb 2025, all of the content related to MPICH has been removed... but never appeared into the MPICH repo...
However, we still have to add somme suppressions in our case, which is maybe our "bad" usage of MPICH/ch4:ofi+Valgrind.
Here are the suppressions we added
{
<ucs_config_sscanf_string_strdup_memtrack_c_381>
Memcheck:Leak
match-leak-kinds: reachable
fun:malloc
fun:strdup
fun:ucs_strdup
fun:ucs_config_sscanf_string
fun:ucs_config_parser_parse_field
fun:ucs_config_parser_set_default_values
fun:ucs_config_parser_fill_opts
fun:ucs_global_opts_init
fun:ucs_init
}
{
<ucs_load_modules_ucs_module_loader_add_dl_dir_module_c_101>
Memcheck:Leak
match-leak-kinds: reachable
fun:malloc
fun:ucs_malloc
fun:ucs_module_loader_add_dl_dir
fun:ucs_module_loader_init_paths
fun:ucs_load_modules
fun:call_init
fun:call_init
}
{
<ucs_config_sscanf_array_parser_c_827>
Memcheck:Leak
match-leak-kinds: reachable
fun:calloc
fun:ucs_calloc
fun:ucs_config_sscanf_array
fun:ucs_config_parser_parse_field
fun:ucs_config_parser_set_default_values
fun:ucs_config_parser_fill_opts
fun:ucs_global_opts_init
fun:ucs_init
}
{
<ucs_config_sscanf_string_parser_c_81>
Memcheck:Leak
match-leak-kinds: reachable
fun:malloc
fun:strdup
fun:ucs_strdup
fun:ucs_config_sscanf_string
fun:ucs_config_sscanf_array
fun:ucs_config_parser_parse_field
fun:ucs_config_parser_set_default_values
fun:ucs_config_parser_fill_opts
fun:ucs_global_opts_init
fun:ucs_init
fun:call_init
fun:call_init
fun:_dl_init
}
So I am wondering what should be done: a) add a suppression file or b) add documentation on what configuration options to use and not to use when usging valgrind/sanitizers...
or maybe both?
Please see my request on the mailing (https://lists.mpich.org/pipermail/discuss/2025-October/006767.html) which I copy here for convenience:
Hi,
I have been building MPICH with the following configure options for a long time, mainly to keep my code “Valgrind-clean”:
./configure \ --enable-g=dbg,meminit \ --with-device=ch3:sock \ --enable-romioThis setup worked reasonably well in the past, but recently I’ve been seeing occasional errors with address-sanitizer or valgrind (with 4.3.0 on a single node) such as:
Fatal error in internal_Allreduce_c: Unknown error class, error stack: internal_Allreduce_c(347)...................: MPI_Allreduce_c(sendbuf=0x7ffdeb0b8e90, recvbuf=0x7ffdeb0b8e98, count=1, dtype=0x4c00083a, MPI_SUM, comm=0x84000003) failed MPIR_Allreduce_impl(4826)...................: MPIR_Allreduce_allcomm_auto(4732)...........: MPIR_Allreduce_intra_recursive_doubling(115): MPIC_Sendrecv(266)..........................: MPIC_Wait(90)...............................: MPIR_Wait(751)..............................: MPIR_Wait_state(708)........................: MPIDI_CH3i_Progress_wait(187)...............: an error occurred while handling an event returned by MPIDI_CH3I_Sock_Wait() MPIDI_CH3I_Progress_handle_sock_event(385)..: MPIDI_CH3I_Socki_handle_read(3647)..........: connection failure (set=0,sock=1,errno=104:Connection reset by peer)Is CH3 considered legacy?
I would like to also ask:
What are the recommended configure options in 2025 for building MPICH in a way that works well with Valgrind?
Is it preferable now to move to CH4 (e.g. ch4:ofi or ch4:shm) when debugging with Valgrind?
Are there any other options (besides --enable-g=dbg,meminit) that you would suggest for catching memory errors while keeping Valgrind reports as clean as possible?
Is https://github.com/pmodels/mpich/blob/main/doc/wiki/design/Support_for_Debugging_Memory_Allocation.md up-to-date?
Any guidance on the “best practice” configuration for this use case would be greatly appreciated.
PETSc guys have some options about debug (https://gitlab.com/petsc/petsc/-/blob/main/config/BuildSystem/config/packages/MPICH.py#L94) but still uses CH3 by default. However Satish uses the configuration described above, at least for valgrind CI.
Thanks a lot,
Eric
Thanks!
Eric
Depend on what noise you are getting with Valgrind, for example, with ucx, I think you can grab the one from ucx distribution, If you don't really care about which device to use, I suggest use --with-device=ch4:ofi. You can set FI_PROVIDER=sockets to force libfabric to the sockets provider, which I believe it is valgrind clean.
Ch3 is considered legacy, but we still support it for the reason that If you run on some old system that lack libfabric or ucx support, ch3 maybe your only choice.
Yes, we recommend to use ch4.
The wiki document in your link are mostly "up-to-date" (only because we haven't updated the memory debugging facility for a while). The environment variables listed there seem to be out-of-date. You can find a more accurate list directly in the code: https://github.com/pmodels/mpich/blob/4fea2de829450738c639c85608b60151113c8cfc/src/mpl/src/mem/mpl_trmem.c#L195-L217
(A note to myself: get that doc fixed).
Thank you for your answer @hzhou .
I have 2 additional questions:
You can set
FI_PROVIDER=socketsto force libfabric to the sockets provider, which I believe it is valgrind clean.
-
I have seen there is also
FI_PROVIDER=shm, do you think it is also valgrind clean? (...because in our case, the CI is always on a single node...) -
In fact, what are the configurations MPICH CI uses with valgrind and/or
-fsanitize=address?
(maybe it would be helpful to point to that into the wiki...)
Thanks a lot! :)
We have not tested with the shm provider extensively, so I don't know. The proof is in the pudding.
We currently use -fsanitize=address in our CI testing.