pulsar
pulsar copied to clipboard
[fix] [python client] Better Python garbage collection management for C++-owned objects
Fixes https://github.com/apache/pulsar/issues/16527
Motivation
Copied from issue discussion:
The problem is the GIL. When Python object reference counts are manipulated from compiled code, those manipulations are not atomic or protected by the GIL in any way. Incrementing a refcount is often coincidentally safe to do without the GIL, since the data structures in the Python interpreter that are altered by a refcount-bump are few and not terribly shared. However, decrementing a refcount without the GIL is extremely dangerous; the act of decrementing a refcount can trigger object destruction, which can then trigger more object destruction, and so on: decrementing a refcount triggers an arbitrary number of user functions (destructors) to run in Python, and can trigger wide-ranging changes (including system calls, memory allocation/deallocation--basically anything) across the interpreter's internal state.
Running such operations in true multi-threaded parallel in Python is basically guaranteed to break things. In most cases (I'm guessing here, as I don't know Boost/C++ well), I think the attempt to clean up the reference either blocks or fails in such a way that the C++ runtime won't properly clean up an object, preventing thread reaping from running internally. In some cases, the racy python GC operations overlap with shared interpreter data structures and cause segfaults. In rare cases, "impossible" (i.e. random objects changing type) errors are raised in Python itself, though you may have to run the above snippet for a long time to see one of those.
Such GIL-unprotected refcount manipulation occurs in the Pulsar client here, here, here, and here, though the first and third may be safe from this condition by dint of the fact that they're only invoked directly from calling Python code which already has the GIL.
Modifications
- Done protected logger handle refcount mutations with the GIL, resolving a segfault risk.
- Done resolve thread leak.
- Done add tests.
- Done don't acquire GIL for logging messages that won't be emitted due to log level.
Verifying this change
This change added tests and can be verified via the Python client test suite: python pulsar_test.py
.
- [x]
doc-not-needed
Bugfix; external contracts do not change.
WIP notes: the segfault is fixed as expected, but threads now leak if the log level's low enough to result in at least one call to the Python logger's _log
internal function.
Pull on the thread of "why" and you arrive at an unconditional call to threading.currentThread
down in Python's logging machinery. If that method is called inside a Pulsar thread, the thread leaks thereafter; somehow, a reference to it is manipulated such that the thread doesn't go away on the system.
Fortunately, there's a way to turn that off. Unfortunately, that's the logging.logThreads = 1
or = 0
module global. That's a weird and sad Python config API--not only is there no way to modify it in a race-proof way, but boost-python doesn't make it easy to manipulate the value within the client's C++ code, so the usual import("logging").attr("logThreads") = object(0)
doesn't seem to work (at least for me, my C++ is basically nonexistent, and my Boost is worse).
The easiest thing I know how to do is write a Python wrapper for the logger that temporarily manipulates that value, which I'l try next.
Edit: this is fixed; issue was due to some kinda silly assumptions boost::python
makes when .attr
accessing a const py::object
.
WIP update:
- Fixed thread leak by setting
logThreads
toFalse
inside the C++ code, which should prevent thread-switching by the Python interpreter. I might be wrong about that, but the worst case behavior is a very occasional thread leak of the sort that occurs 100% of the time now, so this is still an improvement. - Added unit test.
- Reduced unnecessary GIL-acquire operations that happened when log messages were sent at levels that weren't enabled for logging.
~Assistance requested from a maintainer: the license check is failing, but I didn't add any new files. How can I tell what files it would like me to add a license for?~
Resolved
Any update on this? To my knowledge it should be mergeable.
Could you fix this test?
======================================================================
FAIL: test_async_func_with_custom_logger (__main__.CustomLoggingTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "custom_logger_test.py", line 48, in test_async_func_with_custom_logger
self.assertEqual(value, result)
AssertionError: 'foo' != None
@BewareMyPower fixing that test is proving troublesome. Something happening in the destructor is corrupting the python interpreter state and causing that test to fail. The failure's not a testing issue though; it's a real problem.
I'll keep working on it.
@BewareMyPower I redid some of this work and pulled more logging related logic into pure Python, which is now passed as a callback function into the C++ code. That ends up having fewer issues, as many of the boost::python
utility functions (like import
) themselves run the various PyErr
functions internally, which mutates the error state that is being relied upon by asyncio
for that failing test.
The code should be a little simpler now, though the line count went up (sorry).
Additionally, the PyError_Restore
pattern added for the new logging code should probably be used for other "C++ calls python and shouldn't mess up pre-existing interpreter error state" hooks in pulsar-client
(e.g. consumer message listeners and the like).
@zbentley Could you merge latest master to have some CI fixes?
@merlimat @Demogorgon314 @RobertIndie Could anyone take a second look?
@tuteng done
/pulsarbot run-failure-checks
could you please cherry-pick this PR to branch-2.9? thanks.
@congbobo184 Done.