ompi icon indicating copy to clipboard operation
ompi copied to clipboard

Possible heap-use-after-free reported by AddressSanitizer inside PMPI_Init call

Open dqwu opened this issue 2 years ago • 3 comments

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

spack installation

Please describe the system on which you are running

  • Operating system/version: CentOS 8
  • Computer hardware: AMD Epyc 7532 processors (32 cores per CPU, 2.4 GHz)
  • Network type: N.A.

Details of the problem

This issue occurs at a machine used by E3SM (e3sm.org) https://e3sm.org/model/running-e3sm/supported-machines/chrysalis-anl

modules used: gcc/9.2.0-ugetvbp openmpi/4.1.3-sxfyy4k

A simple MPI program is built with AddressSanitizer supported by GCC

module load gcc/9.2.0-ugetvbp openmpi/4.1.3-sxfyy4k

cat <<EOF >> test_mpi.c
#include <mpi.h>
int main(int argc, char *argv[])
{
  MPI_Init(&argc, &argv);
  MPI_Finalize();

  return 0;
}
EOF

mpicc -fsanitize=address -static-libasan test_mpi.c

A SLURM job is used to run the MPI executable built above via srun. The output shows some errors detected by AddressSanitizer

=================================================================
==268787==ERROR: AddressSanitizer: heap-use-after-free on address 0x61900001f480 at pc 0x000000448c68 bp 0x7fffffffb4e0 sp 0x7fffffffac90
READ of size 2 at 0x61900001f480 thread T0
    #0 0x448c67 in __interceptor_strlen /tmp/svcbuilder/spack-stage-gcc-9.2.0-ugetvbp5jl5kgy7jwjloyf73vnhhw7db/spack-src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:354
    #1 0x155552d0b9a4 in opal_pmix_base_partial_commit_packed base/pmix_base_fns.c:405
    #2 0x155552d0d423 in s2_put /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/opal/mca/pmix/s2/pmix_s2.c:548
    #3 0x155555029640 in mca_pml_base_pml_selected base/pml_base_select.c:323
    #4 0x155555029640 in mca_pml_base_pml_selected base/pml_base_select.c:318
    #5 0x155555029c90 in mca_pml_base_select base/pml_base_select.c:284
    #6 0x1555550736ac in ompi_mpi_init runtime/ompi_mpi_init.c:647
    #7 0x155554ea80de in PMPI_Init /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mpi/c/profile/pinit.c:67
    #8 0x528ced in main (/gpfs/fs1/home/wuda/ASAN/a.out+0x528ced)
    #9 0x155553e92492 in __libc_start_main (/usr/lib64/libc.so.6+0x23492)
    #10 0x4069bd in _start (/gpfs/fs1/home/wuda/ASAN/a.out+0x4069bd)

0x61900001f480 is located 0 bytes inside of 1036-byte region [0x61900001f480,0x61900001f88c)
freed by thread T0 here:
    #0 0x4eb7ce in __interceptor_realloc /tmp/svcbuilder/spack-stage-gcc-9.2.0-ugetvbp5jl5kgy7jwjloyf73vnhhw7db/spack-src/libsanitizer/asan/asan_malloc_linux.cc:163
    #1 0x155552d0b994 in opal_pmix_base_partial_commit_packed base/pmix_base_fns.c:404

previously allocated by thread T0 here:
    #0 0x4eb5ae in __interceptor_calloc /tmp/svcbuilder/spack-stage-gcc-9.2.0-ugetvbp5jl5kgy7jwjloyf73vnhhw7db/spack-src/libsanitizer/asan/asan_malloc_linux.cc:153
    #1 0x155552d0a604 in pmi_encode base/pmix_base_fns.c:705

SUMMARY: AddressSanitizer: heap-use-after-free /tmp/svcbuilder/spack-stage-gcc-9.2.0-ugetvbp5jl5kgy7jwjloyf73vnhhw7db/spack-src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:354 in __interceptor_strlen
Shadow bytes around the buggy address:
  0x0c327fffbe40: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c327fffbe50: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c327fffbe60: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c327fffbe70: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c327fffbe80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c327fffbe90:[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c327fffbea0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c327fffbeb0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c327fffbec0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c327fffbed0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c327fffbee0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==268787==ABORTING
srun: error: chr-0502: task 0: Exited with exit code 1

Comment

Not sure if this issue is still reproducible in latest Open MPI 5.0 as the related function opal_pmix_base_partial_commit_packed shown in the stack trace has been removed by PR #7202

dqwu avatar May 23 '22 16:05 dqwu

This seems similar to #10415

awlauria avatar May 23 '22 16:05 awlauria

Which version of PMIx are you using?

jsquyres avatar May 23 '22 20:05 jsquyres

Which version of PMIx are you using?

Slurm is reporting pmi2, not PMIx. OpenMPI was built --with-pmi=/usr, not --with-pmix=...

@chrlogin1:chrys$ srun --mpi=list
srun: MPI types are...
srun: none
srun: pmi2
srun: cray_shasta

dqwu avatar May 23 '22 20:05 dqwu