free(): double free detected in tcache 2 Error message from parallel HDF5 MPI-IO using one-sided ROMIO aggregation
Using the mpich build 'mpich/20231026/icc-all-pmix-gpu' on sunspot I am seeing the following error: free(): double free detected in tcache 2 Usng the HDF5 h5bench exerciser benchmark which uses collective MPI-IO for the backend. To get this error I need to do one-sided aggregation which needs to use the lustre file system specified with the following env vars:
ROMIO_FSTYPE_FORCE=lustre:
ROMIO_WRITE_AGGMETHOD=2
ROMIO_READ_AGGMETHOD=2
This will do ROMIO one-sided aggregation using a derived type to transfer the data to the collective buffer. If I additionally specify this env var:
ROMIO_ONESIDED_ALWAYS_RMW=1
The error goes away, this additional setting tells ROMIO to do a read-modify-write for every collective buffer aggregation, HDF5 does alot of read-modify-write anyway but maybe not for every call, so this setting is prebably resulting in more reads, but looking at the one-sided code I can't see anything to explain this, maybe a timing issue? Also, if I set:
ROMIO_WRITE_AGGMETHOD=1
ROMIO_READ_AGGMETHOD=1
So no derived type is used for the one-sided aggregation, instead multiple MPI_Put / MPI_Get are used for each contiguous chunk of data, this error goes away, but instead I get data corruption in the HDF5 file. I have seen data corruption before using this benchmark with just the regular GEN aggregation in previous MPICH builds that went away with this one, so I suspect there is a broader issue in the messaging layer that this ROMIO code uses as opposed to an issue with this one-sided aggregation code itself. So to reproduce on sunspot:
Start interactive job: qsub -lwalltime=60:00 -lselect=1 -A Aurora_deployment -q workq -I
cd /lus/gila/projects/Aurora_deployment/pkcoff/tarurundir
module unload mpich/icc-all-pmix-gpu/52.2
module use /soft/preview-modulefiles/24.086.0
module load mpich/20231026/icc-all-pmix-gpu
export ROMIO_FSTYPE_FORCE=lustre:
export ROMIO_WRITE_AGGMETHOD=2
export ROMIO_READ_AGGMETHOD=2
export ROMIO_HINTS=/lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/romio_hints
export MPIR_CVAR_ENABLE_GPU=1
export MPIR_CVAR_BCAST_POSIX_INTRA_ALGORITHM=mpir
export MPIR_CVAR_ALLREDUCE_POSIX_INTRA_ALGORITHM=mpir
export MPIR_CVAR_BARRIER_POSIX_INTRA_ALGORITHM=mpir
export MPIR_CVAR_REDUCE_POSIX_INTRA_ALGORITHM=mpir
unset MPIR_CVAR_CH4_COLL_SELECTION_TUNING_JSON_FILE
unset MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE
unset MPIR_CVAR_CH4_POSIX_COLL_SELECTION_TUNING_JSON_FILE
export LD_LIBRARY_PATH=/lus/gila/projects/Aurora_deployment/pkcoff/tarurundir:/soft/datascience/aurora_nre_models_frameworks-2024.0/lib/
export FI_PROVIDER=cxi
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_CXI_CQ_FILL_PERCENT=20
export FI_MR_CACHE_MONITOR=disabled
export FI_CXI_OVFLOW_BUF_SIZE=8388608
LD_PRELOAD=/lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/libdarshan.so:/lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/libhdf5.so:/lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/libpnetcdf.so mpiexec -np 16 -ppn 16 --cpu-bind=verbose,list:4:56:5:57:6:58:7:59:8:60:9:61:10:62:11:63 --no-vni -envall -genvall /soft/tools/mpi_wrapper_utils/gpu_tile_compact.sh ./hdf5Exerciser --numdims 3 --minels 128 128 128 --nsizes 1 --bufmult 2 2 2 --metacoll --addattr --usechunked --maxcheck 100000 --fileblocks 128 128 128 --filestrides 128 128 128 --memstride 128 --memblock 128
You should see this:
free(): double free detected in tcache 2