ompi icon indicating copy to clipboard operation
ompi copied to clipboard

hdf5 testphdf5 segfaults on exit

Open opoplawski opened this issue 3 years ago • 3 comments

Thank you for taking the time to submit an issue!

Background information

Testing out openmpi 5 for Fedora. hdf5 1.12.1 test fails.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

5.0.0-rc6

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Updated Fedora package from https://copr.fedorainfracloud.org/coprs/orion/openmpi5

Please describe the system on which you are running

  • Operating system/version: Fedora Rawhide
  • Computer hardware: x86_64
  • Network type: single node

Details of the problem

shell$ mpirun -n 1 testpar/testphdf5
===================================
PHDF5 tests finished with no errors
===================================
[Thread 0x7ffff67c6640 (LWP 332) exited]

Thread 1 "testphdf5" received signal SIGSEGV, Segmentation fault.
0x0000000000000110 in ?? ()
#0  0x0000000000000110 in ?? ()
#1  0x00007ffff7354cc9 in opal_obj_run_destructors (object=0x7ffff5fb5940) at ../opal/class/opal_object.h:472
#2  opal_free_list_destruct (fl=0x7ffff7414710 <mca_btl_tcp_component+784>) at class/opal_free_list.c:96
#3  0x00007ffff73ba249 in opal_obj_run_destructors (object=<optimized out>) at ../opal/class/opal_object.h:472
#4  mca_btl_tcp_component_close () at mca/btl/tcp/btl_tcp_component.c:474
#5  0x00007ffff7377dcd in mca_base_component_close (component=0x7ffff7414400 <mca_btl_tcp_component>, output_id=-1) at mca/base/mca_base_components_close.c:52
#6  0x00007ffff7377ebd in mca_base_components_close (output_id=<optimized out>, components=0x7ffff7413810 <opal_btl_base_framework+80>, skip=0x0) at mca/base/mca_base_components_close.c:89
#7  0x00007ffff739f7b8 in mca_btl_base_close () at mca/btl/base/btl_base_frame.c:231
#8  0x00007ffff7383694 in mca_base_framework_close (framework=0x7ffff74137c0 <opal_btl_base_framework>) at mca/base/mca_base_framework.c:250
#9  0x00007ffff76fc75c in mca_bml_base_close () at mca/bml/base/bml_base_frame.c:130
#10 mca_bml_base_close () at mca/bml/base/bml_base_frame.c:120
#11 0x00007ffff7383694 in mca_base_framework_close (framework=0x7ffff788dca0 <ompi_bml_base_framework>) at mca/base/mca_base_framework.c:250
#12 0x00007ffff769db2a in ompi_mpi_instance_finalize_common () at instance/instance.c:894
#13 0x00007ffff769dcad in ompi_mpi_instance_finalize (instance=0x7ffff78a4448 <ompi_mpi_instance_default>) at instance/instance.c:924
#14 0x00007ffff768a794 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:294
#15 0x000055555555b661 in ?? ()
#16 0x00007ffff7448550 in __libc_start_call_main () from /lib64/libc.so.6
#17 0x00007ffff7448609 in __libc_start_main_impl () from /lib64/libc.so.6
#18 0x000055555555be85 in ?? ()

#1  0x00007ffff7354cc9 in opal_obj_run_destructors (object=0x7ffff5fb5940) at ../opal/class/opal_object.h:472
472             (*cls_destruct)(object);
467
468         assert(NULL != object->obj_class);
469
470         cls_destruct = object->obj_class->cls_destruct_array;
471         while (NULL != *cls_destruct) {
472             (*cls_destruct)(object);
473             cls_destruct++;
474         }
475   }
476
print object
 $1 = (opal_object_t *) 0x7ffff5fb5940
print *object
$2 = {obj_class = 0x7ffff7413f60 <mca_btl_tcp_frag_eager_t_class>, obj_reference_count = 1}
print *object->obj_class
$3 = {cls_name = 0x7ffff73db967 "mca_btl_tcp_frag_eager_t", cls_parent = 0x7ffff7415600 <mca_btl_base_descriptor_t_class>, cls_construct = 0x7ffff73bba90 <mca_btl_tcp_frag_eager_constructor>, cls_destruct = 0x0, cls_initialized = 1, 
  cls_depth = 4, cls_construct_array = 0x5555556f03d0, cls_destruct_array = 0x5555556f03f0, cls_sizeof = 304}

#2  opal_free_list_destruct (fl=0x7ffff7414710 <mca_btl_tcp_component+784>) at class/opal_free_list.c:96
96              OBJ_DESTRUCT(fl_item);
91              fl_item = (opal_free_list_item_t *) item;
92
93              /* destruct the item (we constructed it), the underlying memory will be
94               * reclaimed when we free the slab (opal_free_list_memory_t ptr)
95               * containing it */
96              OBJ_DESTRUCT(fl_item);
97          }
98
99          while (NULL != (item = opal_list_remove_first(&fl->fl_allocations))) {
100             opal_free_list_allocation_release(fl, (opal_free_list_memory_t *) item);

valgrind output: valgrind.txt

opoplawski avatar May 08 '22 00:05 opoplawski

@edgargabriel can you take a look? There seems to be some double-free's in the finalize of ompio based on the provided valgrind file.

awlauria avatar May 09 '22 14:05 awlauria

@awlauria may I ask how did you come to the conclusion that the problem stems from ompio? I looked over the valgrind file and I could not find an indication for that. This doesn't mean that the problem is not there (or its not due to something in ompio), I just would like to understand what points to ompio here.

As a side note, I am debugging a very similarly looking problem with the accelerator framework as well. Without having identified the root cause it seems to me that component_close is called on some frameworks/components multiple times, not sure what could be the reason for this.

edgargabriel avatar Jun 08 '22 13:06 edgargabriel

You might be right and it is a more general issue in Finalize, I was basing it on the valgrind file. Personally I haven't seen this outside of the ompio component, but I could be wrong. I was thinking it may be an issue with how the ompio component is closed, maybe it's being closed twice somehow?

awlauria avatar Jun 08 '22 13:06 awlauria

I do not see this issue with the current 5.0 and main release (anymore), can we close it?

edgargabriel avatar Dec 31 '22 16:12 edgargabriel

Closing. Can always be reopened if necessary.

edgargabriel avatar Jan 14 '23 15:01 edgargabriel