tensorflow icon indicating copy to clipboard operation
tensorflow copied to clipboard

crash via tf_should_use format_stack

Open albertz opened this issue 7 years ago • 24 comments

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: -
  • TensorFlow installed from (source or binary): binary (pip)
  • TensorFlow version (use command below): v1.11.0-0-gc19e29306c 1.11.0
  • Python version: 3.6.3
  • Bazel version (if compiling from source): -
  • GCC/Compiler version (if compiling from source): -
  • CUDA/cuDNN version: 8.0
  • GPU model and memory: GTX 680 (will not be used)
  • Exact command to reproduce: -

Describe the problem

When __repr__ is called on some TF objects at the wrong time, this can lead to a crash (seg fault; see below). There can be various reasons why this can happen, e.g. when a debugger shows the locals of all threads. My case was this, but I think this doesn't matter:

  • Via better_exchook, I extended the output of sys.excepthook and some traceback functions to print out some local vars and their __repr__ output. There is something similar for IPython.
  • I created some tf.TensorArray and called unstack and I did not use the result value. That unstack method is wrapped via should_use_result.
  • The Python GC called the _TFShouldUseHelper.__del__ function at some random point, and this triggered the stack formating and then the call some some __repr__ of some TF objects.

Originally, this happened at exit, and I thought that probably it's just not safe at exit to touch any existing TF objects. So I fixed that case in better_exchook: It will not print any vars at exit. A test case to reproduce exactly that case is here.

However, now I get the same crash also not at exit but at another random point (see stack below). It will be hard to come up with a test case for this, as it is very non-deterministic when exactly the GC runs and calls the __del__ function.

Source code / logs

Current thread 0x00007f14209e8700 (most recent call first):
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1897 in name
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 352 in name
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 614 in __repr__
  File "/u/zeyer/setups/librispeech/2018-02-26--att/returnn/tests/../better_exchook.py", line 250 in pretty_print
  File "/u/zeyer/setups/librispeech/2018-02-26--att/returnn/tests/../better_exchook.py", line 487 in format_py_obj
  File "/u/zeyer/setups/librispeech/2018-02-26--att/returnn/tests/../better_exchook.py", line 571 in <lambda>
  File "/u/zeyer/setups/librispeech/2018-02-26--att/returnn/tests/../better_exchook.py", line 522 in _trySet
  File "/u/zeyer/setups/librispeech/2018-02-26--att/returnn/tests/../better_exchook.py", line 571 in format_tb
  File "/u/zeyer/.linuxbrew/opt/python3/lib/python3.6/traceback.py", line 37 in format_list
  File "/u/zeyer/.linuxbrew/opt/python3/lib/python3.6/traceback.py", line 193 in format_stack
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/util/tf_should_use.py", line 60 in __del__
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 81 in __init__
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4181 in _add_device_to_stack
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4243 in device
  File "/u/zeyer/.linuxbrew/opt/python3/lib/python3.6/contextlib.py", line 81 in __enter__
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3366 in _GroupControlDeps
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3415 in group
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3486 in tuple
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 791 in _GradientsHelper
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py", line 596 in gradients
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 517 in compute_gradients
  File "/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 401 in minimize
  File "tests/test_TFNetworkRecLayer.py", line 219 in test_rhn_nan
  File "tests/test_TFNetworkRecLayer.py", line 2175 in <module>

"ops.py", line 1897 in name, that is this code:

  @property
  def name(self):
    """The full name of this operation."""
    return c_api.TF_OperationName(self._c_op)

I often also see this just before the crash:

pure virtual method called

A Travis log with this crash can also be seen here, or here.

The C backtrace is this:

/lib/x86_64-linux-gnu/libpthread.so.0(raise+0x29)[0x7f7e8df1a269]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f7e8df1a390]
/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(TF_OperationName+0xa)[0x7f7e5ccc0eca]
/u/zeyer/py-envs/py36-tf111/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so(+0x1982264)[0x7f7e5ca78264]
/u/zeyer/.linuxbrew/Cellar/python3/3.6.3/lib/libpython3.6m.so.1.0(_PyCFunction_FastCallDict+0x209)[0x7f7e8e1f61c9]
...

albertz avatar Oct 05 '18 16:10 albertz

@ebrevdo any idea what could be causing this?

alextp avatar Oct 15 '18 19:10 alextp

My guess: Some Swig internals, which do not expect a thread change in certain context (which is triggered here by the Python GC calling __del__ in some unexpected context).

albertz avatar Oct 15 '18 20:10 albertz

@allenlavoie may have insight.

ebrevdo avatar Oct 15 '18 21:10 ebrevdo

Nothing jumps out to me as an obvious cause. Sounds like this needs debugging, and without a more concrete reproduction I'm not sure there's much to be done.

Is there a loop you can construct which eventually results in this bug being triggered?

allenlavoie avatar Oct 22 '18 19:10 allenlavoie

I'm having a hard time replicating the issue. I ran:

$ python3 --version Python 3.5.3

$ python3 -c 'import tensorflow; print(tensorflow.version)' 1.13.0-dev20181121

$ python3 test-tf111-tfshoulduse-crash.py

2018-11-21 15:59:19.821636: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA create graph WARNING:tensorflow:From /home/ebrevdo/.local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Colocations handled automatically by placer. variables: [<tf.Variable 'b:0' shape=(10, 1, 6) dtype=float32_ref>] init vars graph size: 8668 train step 0, loss: 1.596843 EXCEPTION Traceback (most recent call last): File "test-tf111-tfshoulduse-crash.py", line 217, in test line: raise Exception("foo") locals: Exception = <class 'Exception'> Exception: foo Exit. atexit handler EXCEPTION Traceback (most recent call last): (Exclude vars because we are exiting.) File "test-tf111-tfshoulduse-crash.py", line 229, in at_exit_handler line: raise Exception("foo") Exception: foo Dummy Goodbye ERROR:tensorflow:================================== Object was never used (type <class 'tensorflow.python.ops.tensor_array_ops.TensorArray'>): <tensorflow.python.ops.tensor_array_ops.TensorArray object at 0x7f2390b5b6d8> If you want to mark it as used call its "mark_used()" method. It was originally created here: File "test-tf111-tfshoulduse-crash.py", line 240, in line: print("Exit.") File "test-tf111-tfshoulduse-crash.py", line 219, in test line: sys.excepthook(*sys.exc_info()) File "/home/ebrevdo/.local/lib/python3.5/site-packages/tensorflow/python/util/tf_should_use.py", line 189, in wrapped line: return _add_should_use_warning(fn(*args, **kwargs))

...

I had a similar successful run with TF nightly from september.

ebrevdo avatar Nov 22 '18 00:11 ebrevdo

You used the better_exchook version which includes the workaround for this case. Can you try an older version?

Am Do., 22. Nov. 2018, 08:03 hat ebrevdo [email protected] geschrieben:

I'm having a hard time replicating the issue. I ran:

$ python3 --version
Python 3.5.3

$ python3 -c 'import tensorflow; print(tensorflow.__version__)'
1.13.0-dev20181121

$ python3 test-tf111-tfshoulduse-crash.py

2018-11-21 15:59:19.821636: I
tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports
instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
create graph
WARNING:tensorflow:From

/home/ebrevdo/.local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py:263:
colocate_with (from tensorflow.python.framework.ops) is deprecated and will
be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
variables:
[<tf.Variable 'b:0' shape=(10, 1, 6) dtype=float32_ref>]
init vars
graph size: 8668
train
step 0, loss: 1.596843
EXCEPTION
Traceback (most recent call last):
File "test-tf111-tfshoulduse-crash.py", line 217, in test
line: raise Exception("foo")
locals:
Exception = <builtin> <class 'Exception'>
Exception: foo
Exit.
atexit handler
EXCEPTION
Traceback (most recent call last):
(Exclude vars because we are exiting.)
File "test-tf111-tfshoulduse-crash.py", line 229, in at_exit_handler
line: raise Exception("foo")
Exception: foo
Dummy Goodbye
ERROR:tensorflow:==================================
Object was never used (type <class
'tensorflow.python.ops.tensor_array_ops.TensorArray'>):
<tensorflow.python.ops.tensor_array_ops.TensorArray object at
0x7f2390b5b6d8>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
File "test-tf111-tfshoulduse-crash.py", line 240, in <module> line:
print("Exit.") File "test-tf111-tfshoulduse-crash.py", line 219, in test
line: sys.excepthook(*sys.exc_info()) File

"/home/ebrevdo/.local/lib/python3.5/site-packages/tensorflow/python/util/tf_should_use.py",
line 189, in wrapped line: return _add_should_use_warning(fn(*args,
**kwargs))
==================================
...

I had a similar successful run with TF nightly from september.

On Wed, Nov 21, 2018 at 10:57 AM, Alfred Sorten Wolf < [email protected]> wrote:

Nagging Assignee @ebrevdo https://github.com/ebrevdo: It has been 29 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/tensorflow/tensorflow/issues/22770#issuecomment-440774695 , or mute the thread < https://github.com/notifications/unsubscribe-auth/ABtim9IkzbUaa0mtj5s1WZtMyuWB79Mjks5uxaIxgaJpZM4XKiRX

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/22770#issuecomment-440855291, or mute the thread https://github.com/notifications/unsubscribe-auth/AADm_Og_cTU0HLyNrYC9i4JFTfYNy37nks5uxenKgaJpZM4XKiRX .

albertz avatar Nov 22 '18 00:11 albertz

I removed better_exchook (removed the import, and the two commands in main) and still am no able to replicate in py3.5

ebrevdo avatar Nov 22 '18 00:11 ebrevdo

See my earlier explanation. Only with better_exchook you can trigger this crash.

Am Do., 22. Nov. 2018, 08:34 hat ebrevdo [email protected] geschrieben:

I removed better_exchook (removed the import, and the two commands in main) and still am no able to replicate in py3.5

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/22770#issuecomment-440864313, or mute the thread https://github.com/notifications/unsubscribe-auth/AADm_IWoiR4eA4WIOBZ5tbeIXC5NLRDSks5uxfEFgaJpZM4XKiRX .

albertz avatar Nov 22 '18 00:11 albertz

oh i see; an older version of better_exchook. checking...

On Wed, Nov 21, 2018 at 4:36 PM, Albert Zeyer [email protected] wrote:

See my earlier explanation. Only with better_exchook you can trigger this crash.

Am Do., 22. Nov. 2018, 08:34 hat ebrevdo [email protected] geschrieben:

I removed better_exchook (removed the import, and the two commands in main) and still am no able to replicate in py3.5

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/tensorflow/tensorflow/issues/ 22770#issuecomment-440864313>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADm_ IWoiR4eA4WIOBZ5tbeIXC5NLRDSks5uxfEFgaJpZM4XKiRX> .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/22770#issuecomment-440865763, or mute the thread https://github.com/notifications/unsubscribe-auth/ABtim1vtZRsHzzizd3uF2jUcJkLmD023ks5uxfGJgaJpZM4XKiRX .

ebrevdo avatar Nov 22 '18 00:11 ebrevdo

ok i was able to replicate the issue. gonna see if i can run this under address sanitizer...

ebrevdo avatar Nov 22 '18 00:11 ebrevdo

OK; asan picked something up:

Colocations handled automatically by placer.
variables:
[<tf.Variable 'b:0' shape=(10, 1, 6) dtype=float32_ref>]
init vars
graph size: 8668
train
step 0, loss: 1.596843
EXCEPTION
Traceback (most recent call last):
  File "test-tf111-tfshoulduse-crash.py", line 75, in test
    line: raise Exception("foo")
    locals:
      Exception = <builtin> <class 'Exception'>
Exception: foo
Exit blah.
atexit handler
EXCEPTION
Traceback (most recent call last):
  File "test-tf111-tfshoulduse-crash.py", line 87, in at_exit_handler
    line: raise Exception("foo")
    locals:
      Exception = <builtin> <class 'Exception'>
Exception: foo
Dummy Goodbye
=================================================================
==229269==ERROR: AddressSanitizer: heap-use-after-free on address 0x62500077e338 at pc 0x55e64a6b258e bp 0x7ffffdcce0c0 sp 0x7ffffdcce0b8
READ of size 8 at 0x62500077e338 thread T0
    #0 0x55e64a6b258d in std::__shared_ptr<tensorflow::NodeProperties, (__gnu_cxx::_Lock_policy)2>::operator->() const crosstool/stable/toolchain/bin/../lib/gcc/x86_64-linux-gnu/version/../../../../x86_64-linux-gnu/include/c++/version/bits/shared_ptr_base.h:1046:9
    #1 0x55e64a6a58df in tensorflow::Node::name() const tensorflow/core/graph/graph.cc:159:43
    #2 0x55e63bd73458 in TF_OperationName tensorflow/c/c_api.cc:1418:21
    #3 0x7fc5b33dfa94 in _wrap_TF_OperationName(_object*, _object*) bazel-out/asan-py3-dbg/genfiles/tensorflow/python/_third__party_tensorflow_python_pywrap__tensorflow__internal.cc:12098:22
    #4 0x55e64d31f265 in _PyCFunction_FastCallDict python_runtime/v3_6/Objects/methodobject.c:234:22
    #5 0x55e64d3a407d in call_function python_runtime/v3_6/Python/ceval.c:4837:9
    #6 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #7 0x55e64d3a4a82 in _PyEval_EvalCodeWithName python_runtime/v3_6/Python/ceval.c:4166:14
    #8 0x55e64d39bd63 in PyEval_EvalCodeEx python_runtime/v3_6/Python/ceval.c:4187:12
    #9 0x55e64d307a85 in function_call python_runtime/v3_6/Objects/funcobject.c:604:14
    #10 0x55e64d2d5e2a in PyObject_Call python_runtime/v3_6/Objects/abstract.c:2261:14
    #11 0x55e64d2f27d2 in property_descr_get python_runtime/v3_6/Objects/descrobject.c:1384:11
    #12 0x55e64d322ba8 in _PyObject_GenericGetAttrWithDict python_runtime/v3_6/Objects/object.c:1066:19
    #13 0x55e64d39ff59 in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:2872:29
    #14 0x55e64d3a4a82 in _PyEval_EvalCodeWithName python_runtime/v3_6/Python/ceval.c:4166:14
    #15 0x55e64d39bd63 in PyEval_EvalCodeEx python_runtime/v3_6/Python/ceval.c:4187:12
    #16 0x55e64d307a85 in function_call python_runtime/v3_6/Objects/funcobject.c:604:14
    #17 0x55e64d2d5e2a in PyObject_Call python_runtime/v3_6/Objects/abstract.c:2261:14
    #18 0x55e64d2f27d2 in property_descr_get python_runtime/v3_6/Objects/descrobject.c:1384:11
    #19 0x55e64d322ba8 in _PyObject_GenericGetAttrWithDict python_runtime/v3_6/Objects/object.c:1066:19
    #20 0x55e64d39ff59 in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:2872:29
    #21 0x55e64d3a4a82 in _PyEval_EvalCodeWithName python_runtime/v3_6/Python/ceval.c:4166:14
    #22 0x55e64d39bd63 in PyEval_EvalCodeEx python_runtime/v3_6/Python/ceval.c:4187:12
    #23 0x55e64d307a85 in function_call python_runtime/v3_6/Objects/funcobject.c:604:14
    #24 0x55e64d2d5e2a in PyObject_Call python_runtime/v3_6/Objects/abstract.c:2261:14
    #25 0x55e64d2f27d2 in property_descr_get python_runtime/v3_6/Objects/descrobject.c:1384:11
    #26 0x55e64d322ba8 in _PyObject_GenericGetAttrWithDict python_runtime/v3_6/Objects/object.c:1066:19
    #27 0x55e64d39ff59 in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:2872:29
    #28 0x55e64d3a5606 in _PyFunction_FastCall python_runtime/v3_6/Python/ceval.c:4919:14
    #29 0x55e64d2d60a1 in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2310:18
    #30 0x55e64d2d6295 in _PyObject_Call_Prepend python_runtime/v3_6/Objects/abstract.c:2373:14
    #31 0x55e64d2d605b in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2331:18
    #32 0x55e64d335e89 in slot_tp_repr python_runtime/v3_6/Objects/typeobject.c:6127:15
    #33 0x55e64d321d26 in PyObject_Repr python_runtime/v3_6/Objects/object.c:490:11
    #34 0x55e64d32edca in tuplerepr python_runtime/v3_6/Objects/tupleobject.c:303:13
    #35 0x55e64d321d26 in PyObject_Repr python_runtime/v3_6/Objects/object.c:490:11
    #36 0x55e64d31f3b2 in _PyCFunction_FastCallDict python_runtime/v3_6/Objects/methodobject.c
    #37 0x55e64d3a407d in call_function python_runtime/v3_6/Python/ceval.c:4837:9
    #38 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #39 0x55e64d3a5606 in _PyFunction_FastCall python_runtime/v3_6/Python/ceval.c:4919:14
    #40 0x55e64d3a405a in call_function python_runtime/v3_6/Python/ceval.c:4858:17
    #41 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #42 0x55e64d3a4a82 in _PyEval_EvalCodeWithName python_runtime/v3_6/Python/ceval.c:4166:14
    #43 0x55e64d3a519b in fast_function python_runtime/v3_6/Python/ceval.c:4978:12
    #44 0x55e64d3a405a in call_function python_runtime/v3_6/Python/ceval.c:4858:17
    #45 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #46 0x55e64d3a4a82 in _PyEval_EvalCodeWithName python_runtime/v3_6/Python/ceval.c:4166:14
    #47 0x55e64d3a519b in fast_function python_runtime/v3_6/Python/ceval.c:4978:12
    #48 0x55e64d3a405a in call_function python_runtime/v3_6/Python/ceval.c:4858:17
    #49 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #50 0x55e64d3a4a82 in _PyEval_EvalCodeWithName python_runtime/v3_6/Python/ceval.c:4166:14
    #51 0x55e64d3a519b in fast_function python_runtime/v3_6/Python/ceval.c:4978:12
    #52 0x55e64d3a405a in call_function python_runtime/v3_6/Python/ceval.c:4858:17
    #53 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #54 0x55e64d3a4a82 in _PyEval_EvalCodeWithName python_runtime/v3_6/Python/ceval.c:4166:14
    #55 0x55e64d3a519b in fast_function python_runtime/v3_6/Python/ceval.c:4978:12
    #56 0x55e64d3a405a in call_function python_runtime/v3_6/Python/ceval.c:4858:17
    #57 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #58 0x55e64d3a5606 in _PyFunction_FastCall python_runtime/v3_6/Python/ceval.c:4919:14
    #59 0x55e64d3a405a in call_function python_runtime/v3_6/Python/ceval.c:4858:17
    #60 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #61 0x55e64d3a4a82 in _PyEval_EvalCodeWithName python_runtime/v3_6/Python/ceval.c:4166:14
    #62 0x55e64d3a519b in fast_function python_runtime/v3_6/Python/ceval.c:4978:12
    #63 0x55e64d3a405a in call_function python_runtime/v3_6/Python/ceval.c:4858:17
    #64 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #65 0x55e64d3a5606 in _PyFunction_FastCall python_runtime/v3_6/Python/ceval.c:4919:14
    #66 0x55e64d2d60a1 in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2310:18
    #67 0x55e64d2d6295 in _PyObject_Call_Prepend python_runtime/v3_6/Objects/abstract.c:2373:14
    #68 0x55e64d2d605b in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2331:18
    #69 0x55e64d336a4f in slot_tp_finalize python_runtime/v3_6/Objects/typeobject.c:6463:15
    #70 0x55e64cd7e172 in finalize_garbage python_runtime/v3_6/Modules/gcmodule.c:806:13
    #71 0x55e64cd7b1b7 in collect python_runtime/v3_6/Modules/gcmodule.c:1005:5
    #72 0x55e64cd7ad19 in collect_with_callback python_runtime/v3_6/Modules/gcmodule.c:1128:14
    #73 0x55e64cd7ab60 in PyGC_Collect python_runtime/v3_6/Modules/gcmodule.c:1594:13
    #74 0x55e64d3cb556 in Py_FinalizeEx python_runtime/v3_6/Python/pylifecycle.c:601:5

0x62500077e338 is located 2616 bytes inside of 8192-byte region [0x62500077d900,0x62500077f900)
freed by thread T0 here:
    #0 0x55e62bb55ac2 in __interceptor_free llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:124:3
    #1 0x55e64aa37578 in tensorflow::core::Arena::~Arena() tensorflow/core/lib/core/arena.cc:66:5
    #2 0x55e64a6a928b in tensorflow::Graph::~Graph() tensorflow/core/graph/graph.cc:372:1
    #3 0x55e63bd873e5 in TF_Graph::~TF_Graph() tensorflow/c/c_api_internal.h:75:8
    #4 0x55e63bd7fb9d in TF_DeleteSession tensorflow/c/c_api.cc:2588:14
    #5 0x7fc5b33f65dc in _wrap_TF_DeleteSession(_object*, _object*) bazel-out/asan-py3-dbg/genfiles/tensorflow/python/_third__party_tensorflow_python_pywrap__tensorflow__internal.cc:16303:5
    #6 0x55e64d31f265 in _PyCFunction_FastCallDict python_runtime/v3_6/Objects/methodobject.c:234:22
    #7 0x55e64d3a407d in call_function python_runtime/v3_6/Python/ceval.c:4837:9
    #8 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #9 0x55e64d3a5606 in _PyFunction_FastCall python_runtime/v3_6/Python/ceval.c:4919:14
    #10 0x55e64d2d60a1 in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2310:18
    #11 0x55e64d2d6295 in _PyObject_Call_Prepend python_runtime/v3_6/Objects/abstract.c:2373:14
    #12 0x55e64d2d605b in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2331:18
    #13 0x55e64d336a4f in slot_tp_finalize python_runtime/v3_6/Objects/typeobject.c:6463:15
    #14 0x55e64cd7e172 in finalize_garbage python_runtime/v3_6/Modules/gcmodule.c:806:13
    #15 0x55e64cd7b1b7 in collect python_runtime/v3_6/Modules/gcmodule.c:1005:5
    #16 0x55e64cd7ad19 in collect_with_callback python_runtime/v3_6/Modules/gcmodule.c:1128:14
    #17 0x55e64cd7ab60 in PyGC_Collect python_runtime/v3_6/Modules/gcmodule.c:1594:13
    #18 0x55e64d3cb556 in Py_FinalizeEx python_runtime/v3_6/Python/pylifecycle.c:601:5

previously allocated by thread T0 here:
    #0 0x55e62bb56ac9 in __interceptor_posix_memalign llvm/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:219:3
    #1 0x55e62ebbd292 in aligned_malloc(unsigned long, unsigned long) base/port.h:897:7
    #2 0x55e64aa37200 in tensorflow::core::Arena::Arena(unsigned long) tensorflow/core/lib/core/arena.cc:54:31
    #3 0x55e64a6a7d4c in tensorflow::Graph::Graph(tensorflow::OpRegistryInterface const*) tensorflow/core/graph/graph.cc:323:7
    #4 0x55e63bd792bc in TF_Graph::TF_Graph() tensorflow/c/c_api.cc:1854:7
    #5 0x55e63bd7942a in TF_NewGraph tensorflow/c/c_api.cc:1860:38
    #6 0x7fc5b33d3f88 in _wrap_TF_NewGraph(_object*, _object*) bazel-out/asan-py3-dbg/genfiles/tensorflow/python/_third__party_tensorflow_python_pywrap__tensorflow__internal.cc:10165:26
    #7 0x55e64d31f265 in _PyCFunction_FastCallDict python_runtime/v3_6/Objects/methodobject.c:234:22
    #8 0x55e64d3a407d in call_function python_runtime/v3_6/Python/ceval.c:4837:9
    #9 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #10 0x55e64d3a5606 in _PyFunction_FastCall python_runtime/v3_6/Python/ceval.c:4919:14
    #11 0x55e64d2d60a1 in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2310:18
    #12 0x55e64d2d6295 in _PyObject_Call_Prepend python_runtime/v3_6/Objects/abstract.c:2373:14
    #13 0x55e64d2d5e2a in PyObject_Call python_runtime/v3_6/Objects/abstract.c:2261:14
    #14 0x55e64d336908 in slot_tp_init python_runtime/v3_6/Objects/typeobject.c:6420:11
    #15 0x55e64d331fc8 in type_call python_runtime/v3_6/Objects/typeobject.c:915:19
    #16 0x55e64d2d605b in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2331:18
    #17 0x55e64d3a4053 in call_function python_runtime/v3_6/Python/ceval.c:4861:17
    #18 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #19 0x55e64d3a5606 in _PyFunction_FastCall python_runtime/v3_6/Python/ceval.c:4919:14
    #20 0x55e64d2d60a1 in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2310:18
    #21 0x55e64d2d6295 in _PyObject_Call_Prepend python_runtime/v3_6/Objects/abstract.c:2373:14
    #22 0x55e64d2d5e2a in PyObject_Call python_runtime/v3_6/Objects/abstract.c:2261:14
    #23 0x55e64d336908 in slot_tp_init python_runtime/v3_6/Objects/typeobject.c:6420:11
    #24 0x55e64d331fc8 in type_call python_runtime/v3_6/Objects/typeobject.c:915:19
    #25 0x55e64d2d605b in _PyObject_FastCallDict python_runtime/v3_6/Objects/abstract.c:2331:18
    #26 0x55e64d3a4053 in call_function python_runtime/v3_6/Python/ceval.c:4861:17
    #27 0x55e64d3a0fed in _PyEval_EvalFrameDefault python_runtime/v3_6/Python/ceval.c:3335:19
    #28 0x55e64d308a38 in gen_send_ex python_runtime/v3_6/Objects/genobject.c:189:14
    #29 0x55e64d398cf8 in builtin_next python_runtime/v3_6/Python/bltinmodule.c:1330:11

SUMMARY: AddressSanitizer: heap-use-after-free crosstool/stable/toolchain/bin/../lib/gcc/x86_64-linux-gnu/version/../../../../x86_64-linux-gnu/include/c++/version/bits/shared_ptr_base.h:1046:9 in std::__shared_ptr<tensorflow::NodeProperties, (__gnu_cxx::_Lock_policy)2>::operator->() const
Shadow bytes around the buggy address:
  0x0c4a800e7c10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4a800e7c20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4a800e7c30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4a800e7c40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4a800e7c50: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c4a800e7c60: fa fa fa fa fa fa fa[fa]fa fa fa fa fa fa fa fa
  0x0c4a800e7c70: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4a800e7c80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4a800e7c90: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4a800e7ca0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c4a800e7cb0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==229269==ABORTING

ebrevdo avatar Nov 23 '18 23:11 ebrevdo

@allenlavoie looks like in this case (test-tf111-tfshoulduse-crash.py in python3 with better_exchook == 20171121.105512) we attempt to access graph data after it has been deleted, presumably this is caused by an interaction with tf_should_use format_stack.

ebrevdo avatar Nov 23 '18 23:11 ebrevdo

Perhaps we can be more careful about when we call format_stack? We do this lazily to avoid the cost of formatting, but is there a way to check that the graph in the stack still exists?

ebrevdo avatar Nov 23 '18 23:11 ebrevdo

We could also consider sanitizing the stack before formatting.

ebrevdo avatar Nov 23 '18 23:11 ebrevdo

So, to make it clear: There is a Python object which corresponds to a graph in C++ which does not exist anymore, or has become invalid? How is this possible? This is via Swig, right? I thought that Swig does some sort of reference counting.

Or does the C++ graph object itself still exists, but accessing it becomes invalid? Is there a flag or so that marks that the object is invalid now? Maybe there should just be a check for this flag and if the object is invalid, any related functions should return some sane value (None or so) or throw a Python exception, instead of this crash?

I feel like cleaning/sanitizing the stack trace to try to avoid any possible access to such objects is just a workaround to the problem.

albertz avatar Nov 26 '18 12:11 albertz

I tried to write some simpler test case. See the commit I just referenced. That code sometimes crashes in various different way.

albertz avatar Nov 26 '18 14:11 albertz

Oh interesting, good find. So maybe we just need to set some Python properties to None when the destructor for the C Graph object runs? https://github.com/tensorflow/tensorflow/blob/73f193aa4b9999e0a5bf7d29b1838e2a662e9507/tensorflow/python/framework/c_api_util.py#L52

allenlavoie avatar Nov 26 '18 18:11 allenlavoie

Hi @albertz ! It is not crashing now in 2.x version any more. Attached gist for reference. Shall we consider it resolved now . Thank you!

mohantym avatar Jul 19 '22 11:07 mohantym

@mohantym Is this also for the code in https://github.com/albertz/playground/commit/114bcaf7abe9da0c083ec64e01bac2357806d523 ?

What was done to resolve this?

albertz avatar Jul 19 '22 12:07 albertz

@albertz ! I have used compatibility mode as it was originally 1.x codebase (updated comments in code ). Yeah , I got the code from your Github repo only. Thank you!

mohantym avatar Jul 19 '22 12:07 mohantym

Yeah , I got the code from your Github repo only.

In your gist, you had the initial code here in this issue, but I was referring to this simplified code: https://github.com/albertz/playground/commit/114bcaf7abe9da0c083ec64e01bac2357806d523

albertz avatar Jul 19 '22 13:07 albertz

Hi @albertz ! I am facing an attribute error with new code. Attached gist for reference. Could you share a Colab gist with error from your side. Thank you!

mohantym avatar Aug 02 '22 11:08 mohantym

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

google-ml-butler[bot] avatar Aug 09 '22 12:08 google-ml-butler[bot]

@mohantym I updated the code for TF2. Please see here: https://github.com/albertz/playground/blob/master/tf-crash-use-after-delete-graph.py The problem is still there. It still crashes.

albertz avatar Aug 09 '22 12:08 albertz

@albertz ! I am observing a different behaviour in 2.9 and nightly. Could you let us know from your end. Thank you!

mohantym avatar Aug 10 '22 04:08 mohantym

I can replicate the error in tf-nightly. in Python 3.10. Running in gdb, here's the stack trace at segfault:

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
__strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:74
74	../sysdeps/x86_64/multiarch/strlen-avx2.S: No such file or directory.
(gdb) bt
#0  __strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:74
#1  0x00007fffbc50f975 in pybind11::detail::type_caster<char, void>::cast(char const*, pybind11::return_value_policy, pybind11::handle) ()
   from /home/ebrevdo/.local/lib/python3.10/site-packages/tensorflow/python/client/_pywrap_tf_session.so
#2  0x00007fffbc52d350 in pybind11::cpp_function::initialize<char const* (*&)(TF_Operation*), char const*, TF_Operation*, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::call_guard<pybind11::gil_scoped_release> >(char const* (*&)(TF_Operation*), char const* (*)(TF_Operation*), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) ()
   from /home/ebrevdo/.local/lib/python3.10/site-packages/tensorflow/python/client/_pywrap_tf_session.so
#3  0x00007fffbc5359a1 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) ()
   from /home/ebrevdo/.local/lib/python3.10/site-packages/tensorflow/python/client/_pywrap_tf_session.so
#4  0x00005555556dbfee in ?? ()
#5  0x00005555556d2c93 in _PyObject_MakeTpCall ()
#6  0x00005555556cc65d in _PyEval_EvalFrameDefault ()
#7  0x00005555556dc798 in _PyFunction_Vectorcall ()
#8  0x00005555556c6ee7 in _PyEval_EvalFrameDefault ()
#9  0x00005555556dc798 in _PyFunction_Vectorcall ()
#10 0x00005555556c6ee7 in _PyEval_EvalFrameDefault ()
#11 0x00005555557aaa22 in ?? ()
#12 0x00005555557aa962 in PyEval_EvalCode ()
#13 0x00005555557d1374 in ?? ()
#14 0x00005555557cbbdb in ?? ()
#15 0x00005555557d1121 in ?? ()
#16 0x00005555557d0754 in _PyRun_SimpleFileObject ()
#17 0x00005555557d04b3 in _PyRun_AnyFileObject ()
#18 0x00005555557c42ca in Py_RunMain ()
#19 0x000055555579ee19 in Py_BytesMain ()
#20 0x00007ffff7c337fd in __libc_start_main (main=0x55555579ede0, argc=2, argv=0x7fffffffde18, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffffde08) at ../csu/libc-start.c:332
#21 0x000055555579ed1a in _start ()

ebrevdo avatar Aug 10 '22 05:08 ebrevdo

(that said, the new tf2 code calls tf.disable_v2_behavior(), which drops you back in TF1-mode. I don't know the support story for this anymore...)

ebrevdo avatar Aug 10 '22 05:08 ebrevdo