returnn icon indicating copy to clipboard operation
returnn copied to clipboard

pytest tests/test_rf_array.py crash in _GLOBAL__sub_I_IpcFabricConfigClient.cpp

Open albertz opened this issue 1 year ago • 6 comments

File IpcFabricConfigClient.cpp.

$ gdb --args python3 -m pytest tests/test_rf_array.py
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
...
Reading symbols from python3...
(gdb) r
Starting program: /work/tools/users/zeyer/linuxbrew/bin/python3 -m pytest tests/test_rf_array.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
================================================================== test session starts ==================================================================
platform linux -- Python 3.11.2, pytest-7.3.1, pluggy-1.0.0
rootdir: /u/zeyer/code/returnn
configfile: pytest.ini
collecting ... [Detaching after vfork from child process 2130305]
[Detaching after vfork from child process 2130306]
[Detaching after vfork from child process 2130307]
warning: File "/u/zeyer/.linuxbrew-homefs/Cellar/gcc/12.2.0/lib/gcc/current/libstdc++.so.6.0.30-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
        add-auto-load-safe-path /u/zeyer/.linuxbrew-homefs/Cellar/gcc/12.2.0/lib/gcc/current/libstdc++.so.6.0.30-gdb.py
line to your configuration file "/u/zeyer/.config/gdb/gdbinit".
To completely disable this security protection add
        set auto-load safe-path /
line to your configuration file "/u/zeyer/.config/gdb/gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
        info "(gdb)Auto-loading safe path"
[Detaching after vfork from child process 2130356]
[Detaching after vfork from child process 2130357]
[Detaching after vfork from child process 2130360]

Program received signal SIGABRT, Aborted.
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
44      pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007ffff77ececf in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007ffff77a2ea2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007ffff778e45c in __GI_abort () at abort.c:79
#4  0x00007fffeefd48d9 in ?? () from /work/tools/users/zeyer/linuxbrew/opt/gcc/lib/gcc/current/libstdc++.so.6
#5  0x00007fffeefdff0a in ?? () from /work/tools/users/zeyer/linuxbrew/opt/gcc/lib/gcc/current/libstdc++.so.6
#6  0x00007fffeefdff75 in std::terminate() () from /work/tools/users/zeyer/linuxbrew/opt/gcc/lib/gcc/current/libstdc++.so.6
#7  0x00007fffeefe01c7 in __cxa_throw () from /work/tools/users/zeyer/linuxbrew/opt/gcc/lib/gcc/current/libstdc++.so.6
#8  0x00007fffeefd7253 in std::__throw_runtime_error(char const*) () from /work/tools/users/zeyer/linuxbrew/opt/gcc/lib/gcc/current/libstdc++.so.6
#9  0x00007fffef00acc3 in std::random_device::_M_getval() () from /work/tools/users/zeyer/linuxbrew/opt/gcc/lib/gcc/current/libstdc++.so.6
#10 0x00007fff5f04169f in _GLOBAL__sub_I_IpcFabricConfigClient.cpp () from /u/zeyer/.local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#11 0x00007ffff7fcad6e in call_init (env=0x4cdc7e0, argv=0x7fffffffd668, argc=4, l=<optimized out>) at dl-init.c:70
#12 call_init (l=<optimized out>, argc=4, argv=0x7fffffffd668, env=0x4cdc7e0) at dl-init.c:26
#13 0x00007ffff7fcae54 in _dl_init (main_map=0x4fb9c80, argc=4, argv=0x7fffffffd668, env=0x4cdc7e0) at dl-init.c:117
#14 0x00007ffff78ae4f5 in __GI__dl_catch_exception (exception=<optimized out>, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:182
#15 0x00007ffff7fd1b76 in dl_open_worker (a=0x7fffffff9570) at dl-open.c:808
#16 dl_open_worker (a=a@entry=0x7fffffff9570) at dl-open.c:771
#17 0x00007ffff78ae4a9 in __GI__dl_catch_exception (exception=<optimized out>, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:208
#18 0x00007ffff7fd1d3a in _dl_open (file=0x7fffb5b1f9d0 "/u/zeyer/.local/lib/python3.11/site-packages/torch/_C.cpython-311-x86_64-linux-gnu.so", 
    mode=<optimized out>, caller_dlopen=0x7ffff7cf8a5f <_PyImport_FindSharedFuncptr+127>, nsid=-2, argc=4, argv=0x7fffffffd668, env=0x4cdc7e0)
    at dl-open.c:883
#19 0x00007ffff77e75f8 in dlopen_doit (a=a@entry=0x7fffffff97e0) at dlopen.c:56
#20 0x00007ffff78ae4a9 in __GI__dl_catch_exception (exception=exception@entry=0x7fffffff9740, operate=<optimized out>, args=<optimized out>)
    at dl-error-skeleton.c:208
#21 0x00007ffff78ae54f in __GI__dl_catch_error (objname=0x7fffffff97a0, errstring=0x7fffffff97a8, mallocedp=0x7fffffff979f, operate=<optimized out>, 
    args=<optimized out>) at dl-error-skeleton.c:227
#22 0x00007ffff77e7176 in _dlerror_run (operate=operate@entry=0x7ffff77e75a0 <dlopen_doit>, args=args@entry=0x7fffffff97e0) at dlerror.c:138
#23 0x00007ffff77e7676 in dlopen_implementation (dl_caller=<optimized out>, mode=<optimized out>, file=<optimized out>) at dlopen.c:71
#24 ___dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:81
#25 0x00007ffff7cf8a5f in _PyImport_FindSharedFuncptr () from /work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0
#26 0x00007ffff7ce6a10 in _PyImport_LoadDynamicModuleWithSpec () from /work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0
#27 0x00007ffff7ce66b8 in _imp_create_dynamic () from /work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0
#28 0x00007ffff7c3d13e in cfunction_vectorcall_FASTCALL () from /work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0
#29 0x00007ffff7c6962d in _PyEval_EvalFrameDefault () from /work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0
...
#124 0x00007ffff7cf8cf8 in Py_RunMain () from /work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0
#125 0x00007ffff7cf8ab9 in Py_BytesMain () from /work/tools/users/zeyer/linuxbrew/lib/libpython3.11.so.1.0
#126 0x00007ffff778f1b7 in __libc_start_call_main (main=main@entry=0x401040 <main>, argc=argc@entry=4, argv=argv@entry=0x7fffffffd668) at ../sysdeps/nptl/libc_start_call_main.h:58
#127 0x00007ffff778f26c in __libc_start_main_impl (main=0x401040 <main>, argc=4, argv=0x7fffffffd668, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd658) at ../csu/libc-start.c:392
#128 0x0000000000401071 in _start () at ../sysdeps/x86_64/start.S:115

I'm posting this because I have not found such stacktrace on Google at all, or any stacktrace related to IpcFabricConfigClient.

Note that it imports TensorFlow first, and then PyTorch. Adding -s to the pytest flags makes this visible. That also changes the behavior. Then it does not crash anymore but it hangs instead. This is the same behavior when I simply do:

import tensorflow
import torch

albertz avatar May 25 '23 13:05 albertz

I notice that it uses libstdc++.so.6 from my Linuxbrew/Homebrew, but maybe the pip-installed Torch was using a different libstdc++?

albertz avatar May 25 '23 13:05 albertz

I just tried with the Ubuntu 22.04 standard Python 3.10 binary, and I get the same crash:

Starting program: /usr/bin/python3.10 -m pytest tests/test_rf_array.py                                                                
[Thread debugging using libthread_db enabled]                                                                                         
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".                                                            
================================================================== test session starts ===============================================
platform linux -- Python 3.10.6, pytest-7.3.1, pluggy-1.0.0
rootdir: /u/zeyer/code/returnn
configfile: pytest.ini
collecting ... [Detaching after vfork from child process 35893]
[Detaching after vfork from child process 35894]
[Detaching after vfork from child process 35895]
^[Detaching after vfork from child process 35988]
[Detaching after vfork from child process 35989]
[Detaching after vfork from child process 35992]

Program received signal SIGABRT, Aborted. 
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737350174528) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory. 
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737350174528) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737350174528) at ./nptl/pthread_kill.c:78 
#2  __GI___pthread_kill (threadid=140737350174528, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7c7f476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff7c657f3 in __GI_abort () at ./stdlib/abort.c:79 
#5  0x00007fffeff87bbe in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 
#6  0x00007fffeff9324c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 
#7  0x00007fffeff932b7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6 
#8  0x00007fffeff93518 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6 
#9  0x00007fffeff8a563 in std::__throw_runtime_error(char const*) () from /lib/x86_64-linux-gnu/libstdc++.so.6 
#10 0x00007fffeffc0563 in std::random_device::_M_getval() () from /lib/x86_64-linux-gnu/libstdc++.so.6 
#11 0x00007fff5d75a6ef in _GLOBAL__sub_I_IpcFabricConfigClient.cpp () from /u/zeyer/.local/lib/python3.10/site-packages/torch/lib/libt
#12 0x00007ffff7fc947e in call_init (l=<optimized out>, argc=argc@entry=4, argv=argv@entry=0x7fffffffd1d8, env=env@entry=0x555558a95cc    at ./elf/dl-init.c:70
#13 0x00007ffff7fc9568 in call_init (env=0x555558a95cc0, argv=0x7fffffffd1d8, argc=4, l=<optimized out>) at ./elf/dl-init.c:33 
#14 _dl_init (main_map=0x555559931bd0, argc=4, argv=0x7fffffffd1d8, env=0x555558a95cc0) at ./elf/dl-init.c:117
#15 0x00007ffff7db1c85 in __GI__dl_catch_exception (exception=<optimized out>, operate=<optimized out>, args=<optimized out>)
    at ./elf/dl-error-skeleton.c:182
#16 0x00007ffff7fd0ff6 in dl_open_worker (a=0x7fffffff36b0) at ./elf/dl-open.c:808 
#17 dl_open_worker (a=a@entry=0x7fffffff36b0) at ./elf/dl-open.c:771
#18 0x00007ffff7db1c28 in __GI__dl_catch_exception (exception=<optimized out>, operate=<optimized out>, args=<optimized out>) 
    at ./elf/dl-error-skeleton.c:208
#19 0x00007ffff7fd134e in _dl_open (file=<optimized out>, mode=-2147483646, caller_dlopen=0x5555557ba52f <_PyImport_FindSharedFuncptr+
    argc=4, argv=<optimized out>, env=0x555558a95cc0) at ./elf/dl-open.c:883 
#20 0x00007ffff7ccd6bc in dlopen_doit (a=a@entry=0x7fffffff3920) at ./dlfcn/dlopen.c:56 
#21 0x00007ffff7db1c28 in __GI__dl_catch_exception (exception=exception@entry=0x7fffffff3880, operate=<optimized out>, args=<optimized    at ./elf/dl-error-skeleton.c:208 
#22 0x00007ffff7db1cf3 in __GI__dl_catch_error (objname=0x7fffffff38d8, errstring=0x7fffffff38e0, mallocedp=0x7fffffff38d7, operate=<o    args=<optimized out>) at ./elf/dl-error-skeleton.c:227 
#23 0x00007ffff7ccd1ae in _dlerror_run (operate=operate@entry=0x7ffff7ccd660 <dlopen_doit>, args=args@entry=0x7fffffff3920) at ./dlfcn#24 0x00007ffff7ccd748 in dlopen_implementation (dl_caller=<optimized out>, mode=<optimized out>, 
    file=0x7fffac9de950 "/u/zeyer/.local/lib/python3.10/site-packages/torch/_C.cpython-310-x86_64-linux-gnu.so") at ./dlfcn/dlopen.c:7#25 ___dlopen (file=file@entry=0x7fffac9de950 "/u/zeyer/.local/lib/python3.10/site-packages/torch/_C.cpython-310-x86_64-linux-gnu.so",
    mode=<optimized out>) at ./dlfcn/dlopen.c:81 
#26 0x00005555557ba52f in _PyImport_FindSharedFuncptr (prefix=0x5555558f1b61 "PyInit", shortname=0x7fffaca03380 "_C", 
    pathname=0x7fffac9de950 "/u/zeyer/.local/lib/python3.10/site-packages/torch/_C.cpython-310-x86_64-linux-gnu.so", fp=0x0)
    at ../Python/dynload_shlib.c:100
#27 0x00005555557b8a27 in _PyImport_LoadDynamicModuleWithSpec (fp=0x0, 
    spec=<ModuleSpec(name='torch._C', loader=<ExtensionFileLoader(name='torch._C', path='/u/zeyer/.local/lib/python3.10/site-packages/
0-x86_64-linux-gnu.so') at remote 0x7fffaca039a0>, origin='/u/zeyer/.local/lib/python3.10/site-packages/torch/_C.cpython-310-x86_64-li
_state=None, submodule_search_locations=None, _set_fileattr=True, _cached=None) at remote 0x7fffaca03940>) at ../Python/importdl.c:137
#28 _imp_create_dynamic_impl (module=<optimized out>, file=<optimized out>, 

albertz avatar May 25 '23 15:05 albertz

Maybe the problem is actually in std::random_device::_M_getval? Some related issues: https://github.com/h2oai/datatable/issues/2453 https://github.com/RobJinman/pro_office_calc/issues/5 https://github.com/boostorg/fiber/issues/249 https://github.com/h2oai/datatable/issues/2453 https://github.com/microsoft/LightGBM/issues/1516

Edit I also posted it here: https://github.com/pytorch/pytorch/issues/102360

albertz avatar May 25 '23 15:05 albertz

Looking at the error message from std::__throw_runtime_error (via info registers and then trial-and-error print (const char*)... on the register values), it is: "random_device could not be read". Code here.

albertz avatar May 25 '23 20:05 albertz

Search for that last error gives some further maybe interesting results:

https://discuss.pytorch.org/t/random-device-could-not-be-read/138697 One interesting bit: "I am also using tensorflow along with pytorch in the script." We also do the same in this test here. Although, as far as I see from the debug output, at the point of the crash, TF has not been imported yet.

https://github.com/JohnSnowLabs/spark-nlp/issues/5943 https://discuss.tensorflow.org/t/tensorflow-linux-wheels-are-being-upgraded-to-manylinux2014/8339

albertz avatar May 25 '23 20:05 albertz

Interestingly, maybe using TensorFlow 2.10 does not cause the problem with the hang in PyTorch? At least I don't get the hang then. However, I don't have the proper CUDA env setup for this, so TF fails to load some CUDA libs, which might also influence the behavior. Or maybe TF 2.12 is also behaving a bit different w.r.t. the CUDA libs and loads them more lazily. I'm not sure.

$ python3.10 -c "import tensorflow; import torch"
2023-05-26 11:11:20.630355: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-26 11:11:20.726512: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cudnn-10.1-v7.6/lib64:/usr/local/cudnn-9.1-v7.1/lib64:/usr/local/cudnn-8.0-v7.0/lib64:/usr/local/cudnn-8.0-v6.0/lib64:/usr/local/cudnn-8.0-v5.1/lib64:/usr/local/cuda-9.1/lib64:/usr/local/cuda-9.1/extras/CUPTI/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/extras/CUPTI/lib64:/usr/local/cuda-10.0/lib64:/usr/local/cuda-10.0/extras/CUPTI/lib64:/usr/local/lib:/usr/local/cuda-6.5/lib64:/usr/lib/atlas-base:/usr/local/cuda-7.5/lib64
2023-05-26 11:11:20.726534: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-05-26 11:11:20.747490: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-05-26 11:11:22.222958: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cudnn-10.1-v7.6/lib64:/usr/local/cudnn-9.1-v7.1/lib64:/usr/local/cudnn-8.0-v7.0/lib64:/usr/local/cudnn-8.0-v6.0/lib64:/usr/local/cudnn-8.0-v5.1/lib64:/usr/local/cuda-9.1/lib64:/usr/local/cuda-9.1/extras/CUPTI/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/extras/CUPTI/lib64:/usr/local/cuda-10.0/lib64:/usr/local/cuda-10.0/extras/CUPTI/lib64:/usr/local/lib:/usr/local/cuda-6.5/lib64:/usr/lib/atlas-base:/usr/local/cuda-7.5/lib64
2023-05-26 11:11:22.223032: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cudnn-10.1-v7.6/lib64:/usr/local/cudnn-9.1-v7.1/lib64:/usr/local/cudnn-8.0-v7.0/lib64:/usr/local/cudnn-8.0-v6.0/lib64:/usr/local/cudnn-8.0-v5.1/lib64:/usr/local/cuda-9.1/lib64:/usr/local/cuda-9.1/extras/CUPTI/lib64:/usr/local/cuda-10.1/lib64:/usr/local/cuda-10.1/extras/CUPTI/lib64:/usr/local/cuda-10.0/lib64:/usr/local/cuda-10.0/extras/CUPTI/lib64:/usr/local/lib:/usr/local/cuda-6.5/lib64:/usr/lib/atlas-base:/usr/local/cuda-7.5/lib64
2023-05-26 11:11:22.223042: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

Why did I try TF 2.10? Because in our GitHub CI, this is what we use there, and this works.

But in addition to that, I have read here that TF has changed sth in recent versions, and they mention:

Q2. What kinds of breakages during the build process are most likely related to these changes? RuntimeError: random_device could not be read

So it mentions exactly the error I saw (but I saw this error in PyTorch...).

(Cross-posted from here.)

albertz avatar May 26 '23 09:05 albertz