ompi
ompi copied to clipboard
hdf5 testphdf5 segfaults on exit
Thank you for taking the time to submit an issue!
Background information
Testing out openmpi 5 for Fedora. hdf5 1.12.1 test fails.
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
5.0.0-rc6
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Updated Fedora package from https://copr.fedorainfracloud.org/coprs/orion/openmpi5
Please describe the system on which you are running
- Operating system/version: Fedora Rawhide
- Computer hardware: x86_64
- Network type: single node
Details of the problem
shell$ mpirun -n 1 testpar/testphdf5
===================================
PHDF5 tests finished with no errors
===================================
[Thread 0x7ffff67c6640 (LWP 332) exited]
Thread 1 "testphdf5" received signal SIGSEGV, Segmentation fault.
0x0000000000000110 in ?? ()
#0 0x0000000000000110 in ?? ()
#1 0x00007ffff7354cc9 in opal_obj_run_destructors (object=0x7ffff5fb5940) at ../opal/class/opal_object.h:472
#2 opal_free_list_destruct (fl=0x7ffff7414710 <mca_btl_tcp_component+784>) at class/opal_free_list.c:96
#3 0x00007ffff73ba249 in opal_obj_run_destructors (object=<optimized out>) at ../opal/class/opal_object.h:472
#4 mca_btl_tcp_component_close () at mca/btl/tcp/btl_tcp_component.c:474
#5 0x00007ffff7377dcd in mca_base_component_close (component=0x7ffff7414400 <mca_btl_tcp_component>, output_id=-1) at mca/base/mca_base_components_close.c:52
#6 0x00007ffff7377ebd in mca_base_components_close (output_id=<optimized out>, components=0x7ffff7413810 <opal_btl_base_framework+80>, skip=0x0) at mca/base/mca_base_components_close.c:89
#7 0x00007ffff739f7b8 in mca_btl_base_close () at mca/btl/base/btl_base_frame.c:231
#8 0x00007ffff7383694 in mca_base_framework_close (framework=0x7ffff74137c0 <opal_btl_base_framework>) at mca/base/mca_base_framework.c:250
#9 0x00007ffff76fc75c in mca_bml_base_close () at mca/bml/base/bml_base_frame.c:130
#10 mca_bml_base_close () at mca/bml/base/bml_base_frame.c:120
#11 0x00007ffff7383694 in mca_base_framework_close (framework=0x7ffff788dca0 <ompi_bml_base_framework>) at mca/base/mca_base_framework.c:250
#12 0x00007ffff769db2a in ompi_mpi_instance_finalize_common () at instance/instance.c:894
#13 0x00007ffff769dcad in ompi_mpi_instance_finalize (instance=0x7ffff78a4448 <ompi_mpi_instance_default>) at instance/instance.c:924
#14 0x00007ffff768a794 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:294
#15 0x000055555555b661 in ?? ()
#16 0x00007ffff7448550 in __libc_start_call_main () from /lib64/libc.so.6
#17 0x00007ffff7448609 in __libc_start_main_impl () from /lib64/libc.so.6
#18 0x000055555555be85 in ?? ()
#1 0x00007ffff7354cc9 in opal_obj_run_destructors (object=0x7ffff5fb5940) at ../opal/class/opal_object.h:472
472 (*cls_destruct)(object);
467
468 assert(NULL != object->obj_class);
469
470 cls_destruct = object->obj_class->cls_destruct_array;
471 while (NULL != *cls_destruct) {
472 (*cls_destruct)(object);
473 cls_destruct++;
474 }
475 }
476
print object
$1 = (opal_object_t *) 0x7ffff5fb5940
print *object
$2 = {obj_class = 0x7ffff7413f60 <mca_btl_tcp_frag_eager_t_class>, obj_reference_count = 1}
print *object->obj_class
$3 = {cls_name = 0x7ffff73db967 "mca_btl_tcp_frag_eager_t", cls_parent = 0x7ffff7415600 <mca_btl_base_descriptor_t_class>, cls_construct = 0x7ffff73bba90 <mca_btl_tcp_frag_eager_constructor>, cls_destruct = 0x0, cls_initialized = 1,
cls_depth = 4, cls_construct_array = 0x5555556f03d0, cls_destruct_array = 0x5555556f03f0, cls_sizeof = 304}
#2 opal_free_list_destruct (fl=0x7ffff7414710 <mca_btl_tcp_component+784>) at class/opal_free_list.c:96
96 OBJ_DESTRUCT(fl_item);
91 fl_item = (opal_free_list_item_t *) item;
92
93 /* destruct the item (we constructed it), the underlying memory will be
94 * reclaimed when we free the slab (opal_free_list_memory_t ptr)
95 * containing it */
96 OBJ_DESTRUCT(fl_item);
97 }
98
99 while (NULL != (item = opal_list_remove_first(&fl->fl_allocations))) {
100 opal_free_list_allocation_release(fl, (opal_free_list_memory_t *) item);
valgrind output: valgrind.txt
@edgargabriel can you take a look? There seems to be some double-free's in the finalize of ompio based on the provided valgrind file.
@awlauria may I ask how did you come to the conclusion that the problem stems from ompio? I looked over the valgrind file and I could not find an indication for that. This doesn't mean that the problem is not there (or its not due to something in ompio), I just would like to understand what points to ompio here.
As a side note, I am debugging a very similarly looking problem with the accelerator framework as well. Without having identified the root cause it seems to me that component_close is called on some frameworks/components multiple times, not sure what could be the reason for this.
You might be right and it is a more general issue in Finalize, I was basing it on the valgrind file. Personally I haven't seen this outside of the ompio component, but I could be wrong. I was thinking it may be an issue with how the ompio component is closed, maybe it's being closed twice somehow?
I do not see this issue with the current 5.0 and main release (anymore), can we close it?
Closing. Can always be reopened if necessary.