malt icon indicating copy to clipboard operation
malt copied to clipboard

Segfault due to infinite recursion when profiling numpy with MALT

Open bernarda78 opened this issue 1 year ago • 4 comments

Hi,

I've encountered a bug where profiling a Python script that uses numpy with MALT leads to a segmentation fault. This issue arose while profiling a scientific application and seems to be directly related to the numpy package.

I've confirmed that recompiling numpy from source does not resolve the issue.

This problem is specific to Red Hat Enterprise Linux 9.4 (Plow). Running the same pipeline on Debian GNU/Linux 12 (Bookworm) or CentOS Linux release 7.9.2009 (Core) does not result in a segmentation fault.


The following command consistently triggers the segfault (note the use of -X dev for debugging purposes):

$HOME/.local/bin/malt -v -- python -X dev -c 'import numpy; print(numpy.__path__); print("done")'

Here is the command output after the segmentation fault:

MALT: Start memory instrumentation of python3.12 - 1855960 by library override.
Fatal Python error: Segmentation fault

Current thread 0x00007f84f5066780 (most recent call first):
  File "<frozen importlib._bootstrap>", line 488 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 1289 in create_module
  File "<frozen importlib._bootstrap>", line 813 in module_from_spec
  File "<frozen importlib._bootstrap>", line 921 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1331 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1360 in _find_and_load
  File "/pbs/software/redhat-9-x86_64/python/3.12.2/lib/python3.12/site-packages/numpy/core/overrides.py", line 8 in <module>
  File "<frozen importlib._bootstrap>", line 488 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 995 in exec_module
  File "<frozen importlib._bootstrap>", line 935 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1331 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1360 in _find_and_load
  File "<frozen importlib._bootstrap>", line 488 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1415 in _handle_fromlist
  File "/pbs/software/redhat-9-x86_64/python/3.12.2/lib/python3.12/site-packages/numpy/core/multiarray.py", line 10 in <module>
  File "<frozen importlib._bootstrap>", line 488 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 995 in exec_module
  File "<frozen importlib._bootstrap>", line 935 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1331 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1360 in _find_and_load
  File "<frozen importlib._bootstrap>", line 488 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1415 in _handle_fromlist
  File "/pbs/software/redhat-9-x86_64/python/3.12.2/lib/python3.12/site-packages/numpy/core/__init__.py", line 24 in <module>
  File "<frozen importlib._bootstrap>", line 488 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 995 in exec_module
  File "<frozen importlib._bootstrap>", line 935 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1331 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1360 in _find_and_load
  File "<frozen importlib._bootstrap>", line 488 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1310 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1360 in _find_and_load
  File "/pbs/software/redhat-9-x86_64/python/3.12.2/lib/python3.12/site-packages/numpy/__config__.py", line 4 in <module>
  File "<frozen importlib._bootstrap>", line 488 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 995 in exec_module
  File "<frozen importlib._bootstrap>", line 935 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1331 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1360 in _find_and_load
  File "/pbs/software/redhat-9-x86_64/python/3.12.2/lib/python3.12/site-packages/numpy/__init__.py", line 130 in <module>
  File "<frozen importlib._bootstrap>", line 488 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 995 in exec_module
  File "<frozen importlib._bootstrap>", line 935 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1331 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1360 in _find_and_load
  File "<string>", line 1 in <module>

The following stack trace, obtained with gdb, shows the first 14 frames from the core dump of the above command:

(gdb) bt 14
#0  0x00007f3d87a8b94c in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007f3d87a3e646 in raise () from /lib64/libc.so.6
#2  <signal handler called>
#3  0x00007f3d887d52b1 in free (ptr=0x0) at ../include/rtld-malloc.h:50
#4  _dl_update_slotinfo (req_modid=1, new_gen=2) at ../elf/dl-tls.c:828
#5  0x00007f3d887d537c in update_get_addr (ti=0x7f3d884e0e08, gen=<optimized out>) at ../elf/dl-tls.c:922
#6  0x00007f3d887d864c in __tls_get_addr () at ../sysdeps/x86_64/tls_get_addr.S:55
#7  0x00007f3d884cea1a in MALT::malt_wrap_free (ptr=0x0, real_free=@0x7f3d887c00f8: 0x7f3d87a99ce0 <free>, retaddr=<optimized out>)
    at /home/user/git/malt_fork/src/lib/wrapper/AllocWrapper.cpp:563
#8  0x00007f3d887d52b7 in free (ptr=<optimized out>) at ../include/rtld-malloc.h:50
#9  _dl_update_slotinfo (req_modid=1, new_gen=2) at ../elf/dl-tls.c:828
#10 0x00007f3d887d537c in update_get_addr (ti=0x7f3d884e0e08, gen=<optimized out>) at ../elf/dl-tls.c:922
#11 0x00007f3d887d864c in __tls_get_addr () at ../sysdeps/x86_64/tls_get_addr.S:55
#12 0x00007f3d884cea1a in MALT::malt_wrap_free (ptr=0x0, real_free=@0x7f3d887c00f8: 0x7f3d87a99ce0 <free>, retaddr=<optimized out>)
    at /home/user/git/malt_fork/src/lib/wrapper/AllocWrapper.cpp:563
#13 0x00007f3d887d52b7 in free (ptr=<optimized out>) at ../include/rtld-malloc.h:50
...

Frames 3 to 13 are repeated approximately 210,000 times, which suggests an infinite recursion filling up the stack.

Please let me know if you need further information. Thanks!

bernarda78 avatar Aug 28 '24 08:08 bernarda78

I will give a look, thanks for reporting and sorry for the long delay before answer, I was unavailable the last 2 months.

svalat avatar Sep 13 '24 16:09 svalat

Hi,

I tried again today, and worked without problem this time. I think there was some software update with the server I was working on. As a side effect, I can not open the core dump anymore...

I looked back at the notes I took and I saw I had saved the bottom of the bt up to where the infinite recursion happens : backtrace_stack_malt-python-numpy.txt in case you think it is still worth looking into it.

Thanks!

bernarda78 avatar Sep 19 '24 13:09 bernarda78

Hi @bernarda78, thanks for the info. It looks a problem with the TLS system. It looked strange to me. I will try to investigate.

The full trace is welcomed, thanks.

svalat avatar Sep 20 '24 08:09 svalat

I'm thinking, it can also be due to a memory corruption. This is strange that it tries to make a free(NULL) at this stage.

svalat avatar Sep 20 '24 08:09 svalat

Fixed while rewriting the wrapping way for python support.

svalat avatar May 23 '25 13:05 svalat