parsec
parsec copied to clipboard
When GPU cannot initialize (OOM) per-stream-info need cleanup
If I simulate being unable to allocate memory on the device, both for data and for streams, I get the following stack:
#5 0x00007ffff7e57995 in parsec_list_destruct (list=0x7ffff7fbd2a0 <parsec_per_stream_infos+64>)
at /home/bosilca/unstable/parsec/parsec/parsec/class/parsec_list.c:45
#6 0x00007ffff7e5bdaa in parsec_obj_run_destructors (object=0x7ffff7fbd2a0 <parsec_per_stream_infos+64>)
at /home/bosilca/unstable/parsec/parsec/parsec/class/parsec_object.h:446
#7 0x00007ffff7e5c102 in parsec_info_destructor (obj=0x7ffff7fbd260 <parsec_per_stream_infos>)
at /home/bosilca/unstable/parsec/parsec/parsec/class/info.c:34
#8 0x00007ffff7eb0ceb in parsec_obj_run_destructors (object=0x7ffff7fbd260 <parsec_per_stream_infos>)
at /home/bosilca/unstable/parsec/parsec/parsec/class/parsec_object.h:446
#9 0x00007ffff7eb35bd in parsec_mca_device_fini () at /home/bosilca/unstable/parsec/parsec/parsec/mca/device/device.c:572
#10 0x00007ffff7e764d0 in parsec_fini (pcontext=0x7fffffff49a0) at /home/bosilca/unstable/parsec/parsec/parsec/parsec.c:1235
#11 0x000000000040374f in main (argc=1, argv=0x7fffffff4b38)
at /home/bosilca/unstable/parsec/parsec/tests/dsl/dtd/dtd_test_allreduce.c:237
The issue seems to be during the release of parsec_per_stream_infos
because there are still infos registered inside. The CUDA code seems to perform actually really well, the devices failing to allocate memory are removed, and the execution unfolds without them.
Originally posted by @bosilca in https://github.com/ICLDisco/parsec/issues/630#issuecomment-1922851237