threadpoolctl icon indicating copy to clipboard operation
threadpoolctl copied to clipboard

MemoryError after exceeding OpenMP/OpenBLAS thread limit

Open lingfeiwang opened this issue 2 years ago • 3 comments

Hello. I use threadpoolctl 2.2.0 which runs very well most of the time. However, after exceeding the OpenMP or OpenBLAS thread limit, threadpoolctl seems to have broken down. It does not recover even after the thread-limit-exceeding processes have been killed, or quite some time after that. The full error message of a simple example is shown below. Is there any way to reset threadpoolctl so it continues to function without having to reboot the computer?

Python 3.9.5 (default, Jun  4 2021, 12:28:51) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.24.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from threadpoolctl import threadpool_limits
   ...: with threadpool_limits(limits=1):
   ...:     a=1
   ...: 
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-1-2121fc2c928d> in <module>
      1 from threadpoolctl import threadpool_limits
----> 2 with threadpool_limits(limits=1):
      3     a=1
      4 
~/.local/lib/python3.9/site-packages/threadpoolctl.py in __init__(self, limits, user_api)
    169             self._check_params(limits, user_api)
    170 
--> 171         self._original_info = self._set_threadpool_limits()
    172 
    173     def __enter__(self):

~/.local/lib/python3.9/site-packages/threadpoolctl.py in _set_threadpool_limits(self)
    266             return None
    267 
--> 268         modules = _ThreadpoolInfo(prefixes=self._prefixes,
    269                                   user_api=self._user_api)
    270         for module in modules:

~/.local/lib/python3.9/site-packages/threadpoolctl.py in __init__(self, user_api, prefixes, modules)
    338 
    339             self.modules = []
--> 340             self._load_modules()
    341             self._warn_if_incompatible_openmp()
    342         else:

~/.local/lib/python3.9/site-packages/threadpoolctl.py in _load_modules(self)
    373             self._find_modules_with_enum_process_module_ex()
    374         else:
--> 375             self._find_modules_with_dl_iterate_phdr()
    376 
    377     def _find_modules_with_dl_iterate_phdr(self):

~/.local/lib/python3.9/site-packages/threadpoolctl.py in _find_modules_with_dl_iterate_phdr(self)
    404             ctypes.c_int,  # Return type
    405             ctypes.POINTER(_dl_phdr_info), ctypes.c_size_t, ctypes.c_char_p)
--> 406         c_match_module_callback = c_func_signature(match_module_callback)
    407 
    408         data = ctypes.c_char_p(b"")

MemoryError: 

lingfeiwang avatar Sep 21 '21 00:09 lingfeiwang

Hi @lingfeiwang, I'm not sure that I understand how you triggered that. Could you detail a bit more the steps that lead to this broken state ?

jeremiedbb avatar Oct 01 '21 09:10 jeremiedbb

Actually I completely did not expect it to happen and therefore did not record the process to reproduce the error, or the error log itself from OpenMP or OpenBLAS. Briefly, I ran some computation in too many parallel processes where each used OpenMP or OpenBLAS possibly through numpy/scipy, so together it exceeded a certain limit, maybe set by the kernel, and reported the related error lines. I then killed such processes and everything seemed to have recovered, except threadpoolctl which I later discovered.

I understand this is super uninformative but trying to reproduce it on a shared computing server would be damaging. I don't know how rare this error appears, but I guess computing servers are constantly tortured on the planet. For me, reboot solved the issue, but someone else might follow up on this thread with more details another day.

lingfeiwang avatar Oct 08 '21 03:10 lingfeiwang

Thanks for the feedback. It might indeed be a bug of the linux kernel or the openmp runtime relying on an incorrectly updated stateful attribute of the system. If that ever happens it would be interesting to start a post-mortem pdb session to introspect the values of the match_module_callback signature. I do not understand how a MemoryError can possibly be raised on this line...

ogrisel avatar Oct 08 '21 09:10 ogrisel