mpich icon indicating copy to clipboard operation
mpich copied to clipboard

Have valgrind suppression files installed

Open ericch1 opened this issue 5 months ago • 5 comments

Hi,

we are compiling and running with an mpich configured with "--enable-g=dbg,meminit" option, we have installed valgrind-devel package.

Configuring MPICH version 4.2.2 with  '--prefix=/opt/mpi/mpich-4.2.2-debug' '--enable-debuginfo' '--enable-g=dbg,meminit' 'CC=/usr/bin/gcc' 'CXX=/usr/bin/g++' 'FC=/usr/bin/gfortran' 'F77=/usr/bin/gfortran' '--with-device=ch3:sock' '--enable-romio'

We saw there are some valgrind suppression files into de repo but they are not "installed" when we launch a "make install".

find . -name "*valg*"
./modules/hwloc/contrib/hwloc-valgrind.supp
./modules/ucx/contrib/valgrind.supp
./src/pm/hydra/modules/hwloc/contrib/hwloc-valgrind.supp

Is this normal?

Mpich finds valgrind headers:

checking valgrind/valgrind.h usability... yes
checking valgrind/valgrind.h presence... yes
checking for valgrind/valgrind.h... yes
checking valgrind/memcheck.h usability... yes
checking valgrind/memcheck.h presence... yes
checking for valgrind/memcheck.h... yes
checking helgrind.h usability... no
checking helgrind.h presence... no
checking for helgrind.h... no
checking valgrind/helgrind.h usability... yes
checking valgrind/helgrind.h presence... yes
checking for valgrind/helgrind.h... yes
checking drd.h usability... no
checking drd.h presence... no
checking for drd.h... no
checking valgrind/drd.h usability... yes
checking valgrind/drd.h presence... yes
checking for valgrind/drd.h... yes
checking whether the valgrind headers are broken or too old... no

I also found this in src/pm/hydra/modules/hwloc/Makefile.in:

# Only install the valgrind suppressions file if we're building in
# standalone mode
@HWLOC_BUILD_STANDALONE_TRUE@dist_pkgdata_DATA = contrib/hwloc-valgrind.supp
all: all-recursive

What is the "standalone mode" ?

Thanks,

Eric

(I asked for this in Oct. 2024: https://lists.mpich.org/pipermail/discuss/2024-October/006701.html).

ericch1 avatar Oct 03 '25 19:10 ericch1

Thanks for creating the issue. MPICH itself do not have valgrind suppression file - we may used to have, but I don't see any now. The ones you found is from hwloc and ucx respectively, both from third party but MPICH have the option to build them as embedded if they are not available from the "system". We don't install or expose anything if they are build as an internal dependency -- referred as the "embedded" mode. If you install hwloc or ucx separately, then they are referred as the "standalone" mode. MPICH will skip the embedded modules if they are found in the system during configure.

hzhou avatar Oct 09 '25 03:10 hzhou

Thanks @hzhou.

My willingness about this request was justified by the presence of a valgrind suppression file maintained by PETSc guys:

https://gitlab.com/petsc/petsc/-/commits/5d8720fa41fb4169420198de95a3fb9ffc339d07/share/petsc/suppressions/valgrind

but as of Feb 2025, all of the content related to MPICH has been removed... but never appeared into the MPICH repo...

However, we still have to add somme suppressions in our case, which is maybe our "bad" usage of MPICH/ch4:ofi+Valgrind.

Here are the suppressions we added

{ 
   <ucs_config_sscanf_string_strdup_memtrack_c_381>
   Memcheck:Leak
   match-leak-kinds: reachable
   fun:malloc
   fun:strdup
   fun:ucs_strdup
   fun:ucs_config_sscanf_string
   fun:ucs_config_parser_parse_field
   fun:ucs_config_parser_set_default_values
   fun:ucs_config_parser_fill_opts
   fun:ucs_global_opts_init
   fun:ucs_init
} 
{ 
   <ucs_load_modules_ucs_module_loader_add_dl_dir_module_c_101>
   Memcheck:Leak
   match-leak-kinds: reachable
   fun:malloc
   fun:ucs_malloc
   fun:ucs_module_loader_add_dl_dir
   fun:ucs_module_loader_init_paths
   fun:ucs_load_modules
   fun:call_init
   fun:call_init
}  
{  
   <ucs_config_sscanf_array_parser_c_827>
   Memcheck:Leak
   match-leak-kinds: reachable
   fun:calloc
   fun:ucs_calloc
   fun:ucs_config_sscanf_array
   fun:ucs_config_parser_parse_field
   fun:ucs_config_parser_set_default_values
   fun:ucs_config_parser_fill_opts
   fun:ucs_global_opts_init
   fun:ucs_init
}  
{  
   <ucs_config_sscanf_string_parser_c_81>
   Memcheck:Leak
   match-leak-kinds: reachable
   fun:malloc
   fun:strdup
   fun:ucs_strdup
   fun:ucs_config_sscanf_string
   fun:ucs_config_sscanf_array
   fun:ucs_config_parser_parse_field
   fun:ucs_config_parser_set_default_values
   fun:ucs_config_parser_fill_opts
   fun:ucs_global_opts_init
   fun:ucs_init
   fun:call_init
   fun:call_init
   fun:_dl_init
}  

So I am wondering what should be done: a) add a suppression file or b) add documentation on what configuration options to use and not to use when usging valgrind/sanitizers...

or maybe both?

Please see my request on the mailing (https://lists.mpich.org/pipermail/discuss/2025-October/006767.html) which I copy here for convenience:

Hi,

I have been building MPICH with the following configure options for a long time, mainly to keep my code “Valgrind-clean”:

./configure \
  --enable-g=dbg,meminit \
  --with-device=ch3:sock \
  --enable-romio

This setup worked reasonably well in the past, but recently I’ve been seeing occasional errors with address-sanitizer or valgrind (with 4.3.0 on a single node) such as:


Fatal error in internal_Allreduce_c: Unknown error class, error stack:

internal_Allreduce_c(347)...................: MPI_Allreduce_c(sendbuf=0x7ffdeb0b8e90, recvbuf=0x7ffdeb0b8e98, count=1, dtype=0x4c00083a, MPI_SUM, comm=0x84000003) failed
MPIR_Allreduce_impl(4826)...................: 
MPIR_Allreduce_allcomm_auto(4732)...........: 
MPIR_Allreduce_intra_recursive_doubling(115): 
MPIC_Sendrecv(266)..........................: 
MPIC_Wait(90)...............................: 
MPIR_Wait(751)..............................: 
MPIR_Wait_state(708)........................: 
MPIDI_CH3i_Progress_wait(187)...............: an error occurred while handling an event returned by MPIDI_CH3I_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(385)..: 
MPIDI_CH3I_Socki_handle_read(3647)..........: connection failure (set=0,sock=1,errno=104:Connection reset by peer)

Is CH3 considered legacy?

I would like to also ask:

  1. What are the recommended configure options in 2025 for building MPICH in a way that works well with Valgrind?

  2. Is it preferable now to move to CH4 (e.g. ch4:ofi or ch4:shm) when debugging with Valgrind?

  3. Are there any other options (besides --enable-g=dbg,meminit) that you would suggest for catching memory errors while keeping Valgrind reports as clean as possible?

  4. Is https://github.com/pmodels/mpich/blob/main/doc/wiki/design/Support_for_Debugging_Memory_Allocation.md up-to-date?

Any guidance on the “best practice” configuration for this use case would be greatly appreciated.

PETSc guys have some options about debug (https://gitlab.com/petsc/petsc/-/blob/main/config/BuildSystem/config/packages/MPICH.py#L94) but still uses CH3 by default. However Satish uses the configuration described above, at least for valgrind CI.

Thanks a lot,

Eric

Thanks!

Eric

ericch1 avatar Oct 09 '25 16:10 ericch1

Depend on what noise you are getting with Valgrind, for example, with ucx, I think you can grab the one from ucx distribution, If you don't really care about which device to use, I suggest use --with-device=ch4:ofi. You can set FI_PROVIDER=sockets to force libfabric to the sockets provider, which I believe it is valgrind clean.

Ch3 is considered legacy, but we still support it for the reason that If you run on some old system that lack libfabric or ucx support, ch3 maybe your only choice.

Yes, we recommend to use ch4.

The wiki document in your link are mostly "up-to-date" (only because we haven't updated the memory debugging facility for a while). The environment variables listed there seem to be out-of-date. You can find a more accurate list directly in the code: https://github.com/pmodels/mpich/blob/4fea2de829450738c639c85608b60151113c8cfc/src/mpl/src/mem/mpl_trmem.c#L195-L217

(A note to myself: get that doc fixed).

hzhou avatar Oct 09 '25 20:10 hzhou

Thank you for your answer @hzhou .

I have 2 additional questions:

You can set FI_PROVIDER=sockets to force libfabric to the sockets provider, which I believe it is valgrind clean.

  1. I have seen there is also FI_PROVIDER=shm, do you think it is also valgrind clean? (...because in our case, the CI is always on a single node...)

  2. In fact, what are the configurations MPICH CI uses with valgrind and/or -fsanitize=address ?
    (maybe it would be helpful to point to that into the wiki...)

Thanks a lot! :)

ericch1 avatar Oct 09 '25 21:10 ericch1

We have not tested with the shm provider extensively, so I don't know. The proof is in the pudding.

We currently use -fsanitize=address in our CI testing.

hzhou avatar Oct 10 '25 14:10 hzhou