crash via tf_should_use format_stack
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: -
- TensorFlow installed from (source or binary): binary (pip)
- TensorFlow version (use command below): v1.11.0-0-gc19e29306c 1.11.0
- Python version: 3.6.3
- Bazel version (if compiling from source): -
- GCC/Compiler version (if compiling from source): -
- CUDA/cuDNN version: 8.0
- GPU model and memory: GTX 680 (will not be used)
- Exact command to reproduce: -
Describe the problem
When __repr__ is called on some TF objects at the wrong time, this can lead to a crash (seg fault; see below). There can be various reasons why this can happen, e.g. when a debugger shows the locals of all threads. My case was this, but I think this doesn't matter:
- Via better_exchook, I extended the output of
sys.excepthookand sometracebackfunctions to print out some local vars and their__repr__output. There is something similar for IPython. - I created some
tf.TensorArrayand calledunstackand I did not use the result value. Thatunstackmethod is wrapped viashould_use_result. - The Python GC called the
_TFShouldUseHelper.__del__function at some random point, and this triggered the stack formating and then the call some some__repr__of some TF objects.
Originally, this happened at exit, and I thought that probably it's just not safe at exit to touch any existing TF objects. So I fixed that case in better_exchook: It will not print any vars at exit. A test case to reproduce exactly that case is here.
However, now I get the same crash also not at exit but at another random point (see stack below). It will be hard to come up with a test case for this, as it is very non-deterministic when exactly the GC runs and calls the __del__ function.
Source code / logs
Current thread 0x00007f14209e8700 (most recent call first):
File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1897 in name
File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 352 in name
File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 614 in __repr__
File "/u/zeyer/setups/librispeech/2018-02-26--att/returnn/tests/../better_exchook.py", line 250 in pretty_print
File "/u/zeyer/setups/librispeech/2018-02-26--att/returnn/tests/../better_exchook.py", line 487 in format_py_obj
File "/u/zeyer/setups/librispeech/2018-02-26--att/returnn/tests/../better_exchook.py", line 571 in <lambda>
File "/u/zeyer/setups/librispeech/2018-02-26--att/returnn/tests/../better_exchook.py", line 522 in _trySet
File "/u/zeyer/setups/librispeech/2018-02-26--att/returnn/tests/../better_exchook.py", line 571 in format_tb
File "/u/zeyer/.linuxbrew/opt/python3/lib/python3.6/traceback.py", line 37 in format_list
File "/u/zeyer/.linuxbrew/opt/python3/lib/python3.6/traceback.py", line 193 in format_stack
File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/util/tf_should_use.py", line 60 in __del__
File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 81 in __init__
File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4181 in _add_device_to_stack
File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4243 in device
File "/u/zeyer/.linuxbrew/opt/python3/lib/python3.6/contextlib.py", line 81 in __enter__
File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3366 in _GroupControlDeps
File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3415 in group
File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3486 in tuple
File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 791 in _GradientsHelper
File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 596 in gradients
File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 517 in compute_gradients
File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 401 in minimize
File "tests/test_TFNetworkRecLayer.py", line 219 in test_rhn_nan
File "tests/test_TFNetworkRecLayer.py", line 2175 in <module>
"ops.py", line 1897 in name, that is this code:
@property
def name(self):
"""The full name of this operation."""
return c_api.TF_OperationName(self._c_op)
I often also see this just before the crash:
pure virtual method called
A Travis log with this crash can also be seen here, or here.
The C backtrace is this:
/lib/x86_64-linux-gnu/libpthread.so.0(raise+0x29)[0x7f7e8df1a269]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f7e8df1a390]
/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(TF_OperationName+0xa)[0x7f7e5ccc0eca]
/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x1982264)[0x7f7e5ca78264]
/u/zeyer/.linuxbrew/Cellar/python3/3.6.3/lib/libpython3.6m.so.1.0(_PyCFunction_FastCallDict+0x209)[0x7f7e8e1f61c9]
...
@ebrevdo any idea what could be causing this?
My guess: Some Swig internals, which do not expect a thread change in certain context (which is triggered here by the Python GC calling __del__ in some unexpected context).
@allenlavoie may have insight.
Nothing jumps out to me as an obvious cause. Sounds like this needs debugging, and without a more concrete reproduction I'm not sure there's much to be done.
Is there a loop you can construct which eventually results in this bug being triggered?
I'm having a hard time replicating the issue. I ran:
$ python3 --version Python 3.5.3
$ python3 -c 'import tensorflow; print(tensorflow.version)' 1.13.0-dev20181121
$ python3 test-tf111-tfshoulduse-crash.py
2018-11-21 15:59:19.821636: I
tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports
instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
create graph
WARNING:tensorflow:From
/home/ebrevdo/.local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py:263:
colocate_with (from tensorflow.python.framework.ops) is deprecated and will
be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
variables:
[<tf.Variable 'b:0' shape=(10, 1, 6) dtype=float32_ref>]
init vars
graph size: 8668
train
step 0, loss: 1.596843
EXCEPTION
Traceback (most recent call last):
File "test-tf111-tfshoulduse-crash.py", line 217, in test
line: raise Exception("foo")
locals:
Exception = <class 'Exception'>
Exception: foo
Exit.
atexit handler
EXCEPTION
Traceback (most recent call last):
(Exclude vars because we are exiting.)
File "test-tf111-tfshoulduse-crash.py", line 229, in at_exit_handler
line: raise Exception("foo")
Exception: foo
Dummy Goodbye
ERROR:tensorflow:==================================
Object was never used (type <class
'tensorflow.python.ops.tensor_array_ops.TensorArray'>):
<tensorflow.python.ops.tensor_array_ops.TensorArray object at
0x7f2390b5b6d8>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
File "test-tf111-tfshoulduse-crash.py", line 240, in line:
print("Exit.") File "test-tf111-tfshoulduse-crash.py", line 219, in test
line: sys.excepthook(*sys.exc_info()) File
"/home/ebrevdo/.local/lib/python3.5/site-packages/tensorflow/python/util/tf_should_use.py",
line 189, in wrapped line: return _add_should_use_warning(fn(*args,
**kwargs))
...
I had a similar successful run with TF nightly from september.
You used the better_exchook version which includes the workaround for this case. Can you try an older version?
Am Do., 22. Nov. 2018, 08:03 hat ebrevdo [email protected] geschrieben:
I'm having a hard time replicating the issue. I ran:
$ python3 --version Python 3.5.3 $ python3 -c 'import tensorflow; print(tensorflow.__version__)' 1.13.0-dev20181121 $ python3 test-tf111-tfshoulduse-crash.py 2018-11-21 15:59:19.821636: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA create graph WARNING:tensorflow:From /home/ebrevdo/.local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. variables: [<tf.Variable 'b:0' shape=(10, 1, 6) dtype=float32_ref>] init vars graph size: 8668 train step 0, loss: 1.596843 EXCEPTION Traceback (most recent call last): File "test-tf111-tfshoulduse-crash.py", line 217, in test line: raise Exception("foo") locals: Exception = <builtin> <class 'Exception'> Exception: foo Exit. atexit handler EXCEPTION Traceback (most recent call last): (Exclude vars because we are exiting.) File "test-tf111-tfshoulduse-crash.py", line 229, in at_exit_handler line: raise Exception("foo") Exception: foo Dummy Goodbye ERROR:tensorflow:================================== Object was never used (type <class 'tensorflow.python.ops.tensor_array_ops.TensorArray'>): <tensorflow.python.ops.tensor_array_ops.TensorArray object at 0x7f2390b5b6d8> If you want to mark it as used call its "mark_used()" method. It was originally created here: File "test-tf111-tfshoulduse-crash.py", line 240, in <module> line: print("Exit.") File "test-tf111-tfshoulduse-crash.py", line 219, in test line: sys.excepthook(*sys.exc_info()) File "/home/ebrevdo/.local/lib/python3.5/site-packages/tensorflow/python/util/tf_should_use.py", line 189, in wrapped line: return _add_should_use_warning(fn(*args, **kwargs)) ================================== ...I had a similar successful run with TF nightly from september.
On Wed, Nov 21, 2018 at 10:57 AM, Alfred Sorten Wolf < [email protected]> wrote:
Nagging Assignee @ebrevdo https://github.com/ebrevdo: It has been 29 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/tensorflow/tensorflow/issues/22770#issuecomment-440774695 , or mute the thread < https://github.com/notifications/unsubscribe-auth/ABtim9IkzbUaa0mtj5s1WZtMyuWB79Mjks5uxaIxgaJpZM4XKiRX
.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/22770#issuecomment-440855291, or mute the thread https://github.com/notifications/unsubscribe-auth/AADm_Og_cTU0HLyNrYC9i4JFTfYNy37nks5uxenKgaJpZM4XKiRX .
I removed better_exchook (removed the import, and the two commands in main) and still am no able to replicate in py3.5
See my earlier explanation. Only with better_exchook you can trigger this crash.
Am Do., 22. Nov. 2018, 08:34 hat ebrevdo [email protected] geschrieben:
I removed better_exchook (removed the import, and the two commands in main) and still am no able to replicate in py3.5
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/22770#issuecomment-440864313, or mute the thread https://github.com/notifications/unsubscribe-auth/AADm_IWoiR4eA4WIOBZ5tbeIXC5NLRDSks5uxfEFgaJpZM4XKiRX .
oh i see; an older version of better_exchook. checking...
On Wed, Nov 21, 2018 at 4:36 PM, Albert Zeyer [email protected] wrote:
See my earlier explanation. Only with better_exchook you can trigger this crash.
Am Do., 22. Nov. 2018, 08:34 hat ebrevdo [email protected] geschrieben:
I removed better_exchook (removed the import, and the two commands in main) and still am no able to replicate in py3.5
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/tensorflow/tensorflow/issues/ 22770#issuecomment-440864313>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADm_ IWoiR4eA4WIOBZ5tbeIXC5NLRDSks5uxfEFgaJpZM4XKiRX> .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/22770#issuecomment-440865763, or mute the thread https://github.com/notifications/unsubscribe-auth/ABtim1vtZRsHzzizd3uF2jUcJkLmD023ks5uxfGJgaJpZM4XKiRX .
ok i was able to replicate the issue. gonna see if i can run this under address sanitizer...
OK; asan picked something up:
Colocations handled automatically by placer.
variables:
[<tf.Variable 'b:0' shape=(10, 1, 6) dtype=float32_ref>]
init vars
graph size: 8668
train
step 0, loss: 1.596843
EXCEPTION
Traceback (most recent call last):
File "test-tf111-tfshoulduse-crash.py", line 75, in test
line: raise Exception("foo")
locals:
Exception = <builtin> <class 'Exception'>
Exception: foo
Exit blah.
atexit handler
EXCEPTION
Traceback (most recent call last):
File "test-tf111-tfshoulduse-crash.py", line 87, in at_exit_handler
line: raise Exception("foo")
locals:
Exception = <builtin> <class 'Exception'>
Exception: foo
Dummy Goodbye
=================================================================
==229269==ERROR: AddressSanitizer: heap-use-after-free on address 0x62500077e338 at pc 0x55e64a6b258e bp 0x7ffffdcce0c0 sp 0x7ffffdcce0b8
READ of size 8 at 0x62500077e338 thread T0
#0 0x55e64a6b258d in std::__shared_ptr<tensorflow::NodeProperties, (__gnu_cxx::_Lock_policy)2>::operator->() const crosstool/stable/toolchain/bin/../lib/gcc/x86_64-linux-gnu/version/../../../../x86_64-linux-gnu/include/c++/version/bits/shared_ptr_base.h:1046:9
#1 0x55e64a6a58df in tensorflow::Node::name() const tensorflow/core/graph/graph.cc:159:43
#2 0x55e63bd73458 in TF_OperationName tensorflow/c/c_api.cc:1418:21
#3 0x7fc5b33dfa94 in _wrap_TF_OperationName(_object*, _object*) bazel-out/asan-py3-dbg/genfiles/tensorflow/python/_third__party_tensorflow_python_pywrap__tensorflow__internal.cc:12098:22
#4 0x55e64d31f265 in _PyCFunction_FastCallDict python_runtime/v3_6/Objects/methodobject.c:234:22
#5 0x55e64d3a407d in call_function python_runtime/v3_6/Python/ceval.c:4837:9
#6 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
#7 0x55e64d3a4a82 in _PyEval_EvalCodeWithName python_runtime/v3_6/Python/ceval.c:4166:14
#8 0x55e64d39bd63 in PyEval_EvalCodeEx python_runtime/v3_6/Python/ceval.c:4187:12
#9 0x55e64d307a85 in function_call python_runtime/v3_6/Objects/funcobject.c:604:14
#10 0x55e64d2d5e2a in PyObject_Call python_runtime/v3_6/Objects/abstract.c:2261:14
#11 0x55e64d2f27d2 in property_descr_get python_runtime/v3_6/Objects/descrobject.c:1384:11
#12 0x55e64d322ba8 in _PyObject_GenericGetAttrWithDict python_runtime/v3_6/Objects/object.c:1066:19
#13 0x55e64d39ff59 in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:2872:29
#14 0x55e64d3a4a82 in _PyEval_EvalCodeWithName python_runtime/v3_6/Python/ceval.c:4166:14
#15 0x55e64d39bd63 in PyEval_EvalCodeEx python_runtime/v3_6/Python/ceval.c:4187:12
#16 0x55e64d307a85 in function_call python_runtime/v3_6/Objects/funcobject.c:604:14
#17 0x55e64d2d5e2a in PyObject_Call python_runtime/v3_6/Objects/abstract.c:2261:14
#18 0x55e64d2f27d2 in property_descr_get python_runtime/v3_6/Objects/descrobject.c:1384:11
#19 0x55e64d322ba8 in _PyObject_GenericGetAttrWithDict python_runtime/v3_6/Objects/object.c:1066:19
#20 0x55e64d39ff59 in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:2872:29
#21 0x55e64d3a4a82 in _PyEval_EvalCodeWithName python_runtime/v3_6/Python/ceval.c:4166:14
#22 0x55e64d39bd63 in PyEval_EvalCodeEx python_runtime/v3_6/Python/ceval.c:4187:12
#23 0x55e64d307a85 in function_call python_runtime/v3_6/Objects/funcobject.c:604:14
#24 0x55e64d2d5e2a in PyObject_Call python_runtime/v3_6/Objects/abstract.c:2261:14
#25 0x55e64d2f27d2 in property_descr_get python_runtime/v3_6/Objects/descrobject.c:1384:11
#26 0x55e64d322ba8 in _PyObject_GenericGetAttrWithDict python_runtime/v3_6/Objects/object.c:1066:19
#27 0x55e64d39ff59 in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:2872:29
#28 0x55e64d3a5606 in _PyFunction_FastCall python_runtime/v3_6/Python/ceval.c:4919:14
#29 0x55e64d2d60a1 in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2310:18
#30 0x55e64d2d6295 in _PyObject_Call_Prepend python_runtime/v3_6/Objects/abstract.c:2373:14
#31 0x55e64d2d605b in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2331:18
#32 0x55e64d335e89 in slot_tp_repr python_runtime/v3_6/Objects/typeobject.c:6127:15
#33 0x55e64d321d26 in PyObject_Repr python_runtime/v3_6/Objects/object.c:490:11
#34 0x55e64d32edca in tuplerepr python_runtime/v3_6/Objects/tupleobject.c:303:13
#35 0x55e64d321d26 in PyObject_Repr python_runtime/v3_6/Objects/object.c:490:11
#36 0x55e64d31f3b2 in _PyCFunction_FastCallDict python_runtime/v3_6/Objects/methodobject.c
#37 0x55e64d3a407d in call_function python_runtime/v3_6/Python/ceval.c:4837:9
#38 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
#39 0x55e64d3a5606 in _PyFunction_FastCall python_runtime/v3_6/Python/ceval.c:4919:14
#40 0x55e64d3a405a in call_function python_runtime/v3_6/Python/ceval.c:4858:17
#41 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
#42 0x55e64d3a4a82 in _PyEval_EvalCodeWithName python_runtime/v3_6/Python/ceval.c:4166:14
#43 0x55e64d3a519b in fast_function python_runtime/v3_6/Python/ceval.c:4978:12
#44 0x55e64d3a405a in call_function python_runtime/v3_6/Python/ceval.c:4858:17
#45 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
#46 0x55e64d3a4a82 in _PyEval_EvalCodeWithName python_runtime/v3_6/Python/ceval.c:4166:14
#47 0x55e64d3a519b in fast_function python_runtime/v3_6/Python/ceval.c:4978:12
#48 0x55e64d3a405a in call_function python_runtime/v3_6/Python/ceval.c:4858:17
#49 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
#50 0x55e64d3a4a82 in _PyEval_EvalCodeWithName python_runtime/v3_6/Python/ceval.c:4166:14
#51 0x55e64d3a519b in fast_function python_runtime/v3_6/Python/ceval.c:4978:12
#52 0x55e64d3a405a in call_function python_runtime/v3_6/Python/ceval.c:4858:17
#53 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
#54 0x55e64d3a4a82 in _PyEval_EvalCodeWithName python_runtime/v3_6/Python/ceval.c:4166:14
#55 0x55e64d3a519b in fast_function python_runtime/v3_6/Python/ceval.c:4978:12
#56 0x55e64d3a405a in call_function python_runtime/v3_6/Python/ceval.c:4858:17
#57 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
#58 0x55e64d3a5606 in _PyFunction_FastCall python_runtime/v3_6/Python/ceval.c:4919:14
#59 0x55e64d3a405a in call_function python_runtime/v3_6/Python/ceval.c:4858:17
#60 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
#61 0x55e64d3a4a82 in _PyEval_EvalCodeWithName python_runtime/v3_6/Python/ceval.c:4166:14
#62 0x55e64d3a519b in fast_function python_runtime/v3_6/Python/ceval.c:4978:12
#63 0x55e64d3a405a in call_function python_runtime/v3_6/Python/ceval.c:4858:17
#64 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
#65 0x55e64d3a5606 in _PyFunction_FastCall python_runtime/v3_6/Python/ceval.c:4919:14
#66 0x55e64d2d60a1 in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2310:18
#67 0x55e64d2d6295 in _PyObject_Call_Prepend python_runtime/v3_6/Objects/abstract.c:2373:14
#68 0x55e64d2d605b in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2331:18
#69 0x55e64d336a4f in slot_tp_finalize python_runtime/v3_6/Objects/typeobject.c:6463:15
#70 0x55e64cd7e172 in finalize_garbage python_runtime/v3_6/Modules/gcmodule.c:806:13
#71 0x55e64cd7b1b7 in collect python_runtime/v3_6/Modules/gcmodule.c:1005:5
#72 0x55e64cd7ad19 in collect_with_callback python_runtime/v3_6/Modules/gcmodule.c:1128:14
#73 0x55e64cd7ab60 in PyGC_Collect python_runtime/v3_6/Modules/gcmodule.c:1594:13
#74 0x55e64d3cb556 in Py_FinalizeEx python_runtime/v3_6/Python/pylifecycle.c:601:5
0x62500077e338 is located 2616 bytes inside of 8192-byte region [0x62500077d900,0x62500077f900)
freed by thread T0 here:
#0 0x55e62bb55ac2 in __interceptor_free llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:124:3
#1 0x55e64aa37578 in tensorflow::core::Arena::~Arena() tensorflow/core/lib/core/arena.cc:66:5
#2 0x55e64a6a928b in tensorflow::Graph::~Graph() tensorflow/core/graph/graph.cc:372:1
#3 0x55e63bd873e5 in TF_Graph::~TF_Graph() tensorflow/c/c_api_internal.h:75:8
#4 0x55e63bd7fb9d in TF_DeleteSession tensorflow/c/c_api.cc:2588:14
#5 0x7fc5b33f65dc in _wrap_TF_DeleteSession(_object*, _object*) bazel-out/asan-py3-dbg/genfiles/tensorflow/python/_third__party_tensorflow_python_pywrap__tensorflow__internal.cc:16303:5
#6 0x55e64d31f265 in _PyCFunction_FastCallDict python_runtime/v3_6/Objects/methodobject.c:234:22
#7 0x55e64d3a407d in call_function python_runtime/v3_6/Python/ceval.c:4837:9
#8 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
#9 0x55e64d3a5606 in _PyFunction_FastCall python_runtime/v3_6/Python/ceval.c:4919:14
#10 0x55e64d2d60a1 in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2310:18
#11 0x55e64d2d6295 in _PyObject_Call_Prepend python_runtime/v3_6/Objects/abstract.c:2373:14
#12 0x55e64d2d605b in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2331:18
#13 0x55e64d336a4f in slot_tp_finalize python_runtime/v3_6/Objects/typeobject.c:6463:15
#14 0x55e64cd7e172 in finalize_garbage python_runtime/v3_6/Modules/gcmodule.c:806:13
#15 0x55e64cd7b1b7 in collect python_runtime/v3_6/Modules/gcmodule.c:1005:5
#16 0x55e64cd7ad19 in collect_with_callback python_runtime/v3_6/Modules/gcmodule.c:1128:14
#17 0x55e64cd7ab60 in PyGC_Collect python_runtime/v3_6/Modules/gcmodule.c:1594:13
#18 0x55e64d3cb556 in Py_FinalizeEx python_runtime/v3_6/Python/pylifecycle.c:601:5
previously allocated by thread T0 here:
#0 0x55e62bb56ac9 in __interceptor_posix_memalign llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:219:3
#1 0x55e62ebbd292 in aligned_malloc(unsigned long, unsigned long) base/port.h:897:7
#2 0x55e64aa37200 in tensorflow::core::Arena::Arena(unsigned long) tensorflow/core/lib/core/arena.cc:54:31
#3 0x55e64a6a7d4c in tensorflow::Graph::Graph(tensorflow::OpRegistryInterface const*) tensorflow/core/graph/graph.cc:323:7
#4 0x55e63bd792bc in TF_Graph::TF_Graph() tensorflow/c/c_api.cc:1854:7
#5 0x55e63bd7942a in TF_NewGraph tensorflow/c/c_api.cc:1860:38
#6 0x7fc5b33d3f88 in _wrap_TF_NewGraph(_object*, _object*) bazel-out/asan-py3-dbg/genfiles/tensorflow/python/_third__party_tensorflow_python_pywrap__tensorflow__internal.cc:10165:26
#7 0x55e64d31f265 in _PyCFunction_FastCallDict python_runtime/v3_6/Objects/methodobject.c:234:22
#8 0x55e64d3a407d in call_function python_runtime/v3_6/Python/ceval.c:4837:9
#9 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
#10 0x55e64d3a5606 in _PyFunction_FastCall python_runtime/v3_6/Python/ceval.c:4919:14
#11 0x55e64d2d60a1 in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2310:18
#12 0x55e64d2d6295 in _PyObject_Call_Prepend python_runtime/v3_6/Objects/abstract.c:2373:14
#13 0x55e64d2d5e2a in PyObject_Call python_runtime/v3_6/Objects/abstract.c:2261:14
#14 0x55e64d336908 in slot_tp_init python_runtime/v3_6/Objects/typeobject.c:6420:11
#15 0x55e64d331fc8 in type_call python_runtime/v3_6/Objects/typeobject.c:915:19
#16 0x55e64d2d605b in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2331:18
#17 0x55e64d3a4053 in call_function python_runtime/v3_6/Python/ceval.c:4861:17
#18 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
#19 0x55e64d3a5606 in _PyFunction_FastCall python_runtime/v3_6/Python/ceval.c:4919:14
#20 0x55e64d2d60a1 in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2310:18
#21 0x55e64d2d6295 in _PyObject_Call_Prepend python_runtime/v3_6/Objects/abstract.c:2373:14
#22 0x55e64d2d5e2a in PyObject_Call python_runtime/v3_6/Objects/abstract.c:2261:14
#23 0x55e64d336908 in slot_tp_init python_runtime/v3_6/Objects/typeobject.c:6420:11
#24 0x55e64d331fc8 in type_call python_runtime/v3_6/Objects/typeobject.c:915:19
#25 0x55e64d2d605b in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2331:18
#26 0x55e64d3a4053 in call_function python_runtime/v3_6/Python/ceval.c:4861:17
#27 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
#28 0x55e64d308a38 in gen_send_ex python_runtime/v3_6/Objects/genobject.c:189:14
#29 0x55e64d398cf8 in builtin_next python_runtime/v3_6/Python/bltinmodule.c:1330:11
SUMMARY: AddressSanitizer: heap-use-after-free crosstool/stable/toolchain/bin/../lib/gcc/x86_64-linux-gnu/version/../../../../x86_64-linux-gnu/include/c++/version/bits/shared_ptr_base.h:1046:9 in std::__shared_ptr<tensorflow::NodeProperties, (__gnu_cxx::_Lock_policy)2>::operator->() const
Shadow bytes around the buggy address:
0x0c4a800e7c10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c4a800e7c20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c4a800e7c30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c4a800e7c40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c4a800e7c50: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c4a800e7c60: fa fa fa fa fa fa fa[fa]fa fa fa fa fa fa fa fa
0x0c4a800e7c70: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c4a800e7c80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c4a800e7c90: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c4a800e7ca0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c4a800e7cb0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
Shadow gap: cc
==229269==ABORTING
@allenlavoie looks like in this case (test-tf111-tfshoulduse-crash.py in python3 with better_exchook == 20171121.105512) we attempt to access graph data after it has been deleted, presumably this is caused by an interaction with tf_should_use format_stack.
Perhaps we can be more careful about when we call format_stack? We do this lazily to avoid the cost of formatting, but is there a way to check that the graph in the stack still exists?
We could also consider sanitizing the stack before formatting.
So, to make it clear: There is a Python object which corresponds to a graph in C++ which does not exist anymore, or has become invalid? How is this possible? This is via Swig, right? I thought that Swig does some sort of reference counting.
Or does the C++ graph object itself still exists, but accessing it becomes invalid? Is there a flag or so that marks that the object is invalid now? Maybe there should just be a check for this flag and if the object is invalid, any related functions should return some sane value (None or so) or throw a Python exception, instead of this crash?
I feel like cleaning/sanitizing the stack trace to try to avoid any possible access to such objects is just a workaround to the problem.
I tried to write some simpler test case. See the commit I just referenced. That code sometimes crashes in various different way.
Oh interesting, good find. So maybe we just need to set some Python properties to None when the destructor for the C Graph object runs? https://github.com/tensorflow/tensorflow/blob/73f193aa4b9999e0a5bf7d29b1838e2a662e9507/tensorflow/python/framework/c_api_util.py#L52
Hi @albertz ! It is not crashing now in 2.x version any more. Attached gist for reference. Shall we consider it resolved now . Thank you!
@mohantym Is this also for the code in https://github.com/albertz/playground/commit/114bcaf7abe9da0c083ec64e01bac2357806d523 ?
What was done to resolve this?
@albertz ! I have used compatibility mode as it was originally 1.x codebase (updated comments in code ). Yeah , I got the code from your Github repo only. Thank you!
Yeah , I got the code from your Github repo only.
In your gist, you had the initial code here in this issue, but I was referring to this simplified code: https://github.com/albertz/playground/commit/114bcaf7abe9da0c083ec64e01bac2357806d523
Hi @albertz ! I am facing an attribute error with new code. Attached gist for reference. Could you share a Colab gist with error from your side. Thank you!
This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.
@mohantym I updated the code for TF2. Please see here: https://github.com/albertz/playground/blob/master/tf-crash-use-after-delete-graph.py The problem is still there. It still crashes.
@albertz ! I am observing a different behaviour in 2.9 and nightly. Could you let us know from your end. Thank you!
I can replicate the error in tf-nightly. in Python 3.10. Running in gdb, here's the stack trace at segfault:
Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
__strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:74
74 ../sysdeps/x86_64/multiarch/strlen-avx2.S: No such file or directory.
(gdb) bt
#0 __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:74
#1 0x00007fffbc50f975 in pybind11::detail::type_caster<char, void>::cast(char const*, pybind11::return_value_policy, pybind11::handle) ()
from /home/ebrevdo/.local/lib/python3.10/site-packages/tensorflow/python/client/_pywrap_tf_session.so
#2 0x00007fffbc52d350 in pybind11::cpp_function::initialize<char const* (*&)(TF_Operation*), char const*, TF_Operation*, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release> >(char const* (*&)(TF_Operation*), char const* (*)(TF_Operation*), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) ()
from /home/ebrevdo/.local/lib/python3.10/site-packages/tensorflow/python/client/_pywrap_tf_session.so
#3 0x00007fffbc5359a1 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) ()
from /home/ebrevdo/.local/lib/python3.10/site-packages/tensorflow/python/client/_pywrap_tf_session.so
#4 0x00005555556dbfee in ?? ()
#5 0x00005555556d2c93 in _PyObject_MakeTpCall ()
#6 0x00005555556cc65d in _PyEval_EvalFrameDefault ()
#7 0x00005555556dc798 in _PyFunction_Vectorcall ()
#8 0x00005555556c6ee7 in _PyEval_EvalFrameDefault ()
#9 0x00005555556dc798 in _PyFunction_Vectorcall ()
#10 0x00005555556c6ee7 in _PyEval_EvalFrameDefault ()
#11 0x00005555557aaa22 in ?? ()
#12 0x00005555557aa962 in PyEval_EvalCode ()
#13 0x00005555557d1374 in ?? ()
#14 0x00005555557cbbdb in ?? ()
#15 0x00005555557d1121 in ?? ()
#16 0x00005555557d0754 in _PyRun_SimpleFileObject ()
#17 0x00005555557d04b3 in _PyRun_AnyFileObject ()
#18 0x00005555557c42ca in Py_RunMain ()
#19 0x000055555579ee19 in Py_BytesMain ()
#20 0x00007ffff7c337fd in __libc_start_main (main=0x55555579ede0, argc=2, argv=0x7fffffffde18, init=<optimized out>, fini=<optimized out>,
rtld_fini=<optimized out>, stack_end=0x7fffffffde08) at ../csu/libc-start.c:332
#21 0x000055555579ed1a in _start ()
(that said, the new tf2 code calls tf.disable_v2_behavior(), which drops you back in TF1-mode. I don't know the support story for this anymore...)