confluent-kafka-python icon indicating copy to clipboard operation
confluent-kafka-python copied to clipboard

Segmentation fault on GC runs in multithreaded context

Open progval opened this issue 1 year ago • 4 comments

Description

I am getting the following crash in very specific conditions, while the Python GC is running.

I can reliably reproduce it in a pytest test suite of an (unfortunately) non-trivial code base. Each test of the suite starts a producer, writes a few messages, then starts a multiprocessing.Manager, then a consumer and consumes topics to the end.

The third case to run segfaults when I manually trigger the GC in a specific place (or the fourth, when the GC eventually runs on its own).

==3552160== Invalid read of size 4
==3552160==    at 0x4A28B17: __pthread_clockjoin_ex (pthread_join_common.c:43)
==3552160==    by 0x4A2F96F: thrd_join@@GLIBC_2.34 (thrd_join.c:26)
==3552160==    by 0x774B3DB: rd_kafka_destroy_app (in /home/dev/.local/lib/python3.11/site-packages/confluent_kafka.libs/librdkafka-55260171.so.1)
==3552160==    by 0x75A86C3: Producer_dealloc (Producer.c:103)
==3552160==    by 0x5CCB14: UnknownInlinedFun (object.c:2390)
==3552160==    by 0x5CCB14: UnknownInlinedFun (object.h:538)
==3552160==    by 0x5CCB14: UnknownInlinedFun (dictobject.c:5566)
==3552160==    by 0x5CCB14: subtype_clear.lto_priv.0 (typeobject.c:1288)
==3552160==    by 0x50ABCD: UnknownInlinedFun (gcmodule.c:1013)
==3552160==    by 0x50ABCD: gc_collect_main.lto_priv.0 (gcmodule.c:1287)
==3552160==    by 0x6303E6: gc_collect_with_callback.lto_priv.0 (gcmodule.c:1400)
==3552160==    by 0x6756CF: UnknownInlinedFun (gcmodule.c:1538)
==3552160==    by 0x6756CF: gc_collect.lto_priv.0 (gcmodule.c.h:110)
==3552160==    by 0x53ACF6: cfunction_vectorcall_FASTCALL_KEYWORDS.lto_priv.0 (methodobject.c:443)
==3552160==    by 0x53AC2B: UnknownInlinedFun (pycore_call.h:92)
==3552160==    by 0x53AC2B: PyObject_Vectorcall (call.c:299)
==3552160==    by 0x52B93F: _PyEval_EvalFrameDefault (ceval.c:4772)
==3552160==    by 0x5855A3: UnknownInlinedFun (pycore_ceval.h:73)
==3552160==    by 0x5855A3: UnknownInlinedFun (ceval.c:6435)
==3552160==    by 0x5855A3: UnknownInlinedFun (call.c:393)
==3552160==    by 0x5855A3: _PyObject_VectorcallTstate.lto_priv.0 (pycore_call.h:92)
==3552160==  Address 0x1201f990 is not stack'd, malloc'd or (recently) free'd
==3552160== 
Fatal Python error: Segmentation fault

Current thread 0x0000000004b80880 (most recent call first):
  Garbage-collecting
  File "/home/dev/swh-environment/swh-dataset/build/__editable__.swh.dataset-1.5.2.dev7+g4fccea4.d20240523-py3-none-any/swh/dataset/journalprocessor.py", line 348 in progress_worker
  File "/usr/lib/python3.11/concurrent/futures/process.py", line 257 in _process_worker
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 71 in _launch
  File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 19 in __init__
  File "/usr/lib/python3.11/multiprocessing/context.py", line 281 in _Popen
  File "/usr/lib/python3.11/multiprocessing/process.py", line 121 in start
  File "/usr/lib/python3.11/concurrent/futures/process.py", line 761 in _spawn_process
  File "/usr/lib/python3.11/concurrent/futures/process.py", line 750 in _launch_processes
  File "/usr/lib/python3.11/concurrent/futures/process.py", line 723 in _start_executor_manager_thread
  File "/usr/lib/python3.11/concurrent/futures/process.py", line 785 in submit
  File "/home/dev/swh-environment/swh-dataset/build/__editable__.swh.dataset-1.5.2.dev7+g4fccea4.d20240523-py3-none-any/swh/dataset/journalprocessor.py", line 316 in run
  File "/home/dev/swh-environment/swh-dataset/swh/dataset/test/test_journal_processor.py", line 170 in test_parallel_journal_processor_origin_visit_status
  File "/home/dev/swh-environment/swh-dataset/swh/dataset/test/test_journal_processor.py", line 50 in newf
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/python.py", line 195 in pytest_pyfunc_call
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/python.py", line 1772 in runtest
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/runner.py", line 172 in pytest_runtest_call
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/runner.py", line 240 in <lambda>
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/runner.py", line 340 in from_call
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/runner.py", line 239 in call_and_report
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/runner.py", line 134 in runtestprotocol
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/runner.py", line 115 in pytest_runtest_protocol
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/main.py", line 364 in pytest_runtestloop
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/main.py", line 339 in _main
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/main.py", line 285 in wrap_session
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/main.py", line 332 in pytest_cmdline_main
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/config/__init__.py", line 174 in main
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/config/__init__.py", line 197 in console_main
  File "/home/dev/.local/bin/pytest", line 8 in <module>

Extension modules: confluent_kafka.cimpl, msgpack._cmsgpack, psutil._psutil_linux, psutil._psutil_posix, psycopg_c.pq, psycopg_c._psycopg, markupsafe._speedups, _brotli, simplejson._speedups, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils, tornado.speedups, plyvel._plyvel (total: 20)
==3552160== 
==3552160== Process terminating with default action of signal 11 (SIGSEGV)
==3552160==    at 0x4A28E2C: __pthread_kill_implementation (pthread_kill.c:44)
==3552160==    by 0x49D9FB1: raise (raise.c:26)
==3552160==    by 0x49DA04F: ??? (in /usr/lib/x86_64-linux-gnu/libc.so.6)
==3552160==    by 0x4A28B16: __pthread_clockjoin_ex (pthread_join_common.c:43)
==3552160== 
==3552160== HEAP SUMMARY:
==3552160==     in use at exit: 29,294,834 bytes in 30,069 blocks
==3552160==   total heap usage: 219,797 allocs, 189,728 frees, 522,042,917 bytes allocated
==3552160== 
==3552160== LEAK SUMMARY:
==3552160==    definitely lost: 4,547 bytes in 25 blocks
==3552160==    indirectly lost: 12,610 bytes in 30 blocks
==3552160==      possibly lost: 423,923 bytes in 214 blocks
==3552160==    still reachable: 28,853,754 bytes in 29,800 blocks
==3552160==         suppressed: 0 bytes in 0 blocks
==3552160== Rerun with --leak-check=full to see details of leaked memory
==3552160== 
==3552160== For lists of detected and suppressed errors, rerun with: -s
==3552160== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
==3552159== 
==3552159== Process terminating with default action of signal 15 (SIGTERM)
==3552159==    at 0x4A23E96: __futex_abstimed_wait_common64 (futex-internal.c:57)
==3552159==    by 0x4A23E96: __futex_abstimed_wait_common (futex-internal.c:87)
==3552159==    by 0x4A2683B: __pthread_cond_wait_common (pthread_cond_wait.c:503)
==3552159==    by 0x4A2683B: pthread_cond_timedwait@@GLIBC_2.3.2 (pthread_cond_wait.c:643)
==3552159==    by 0x4A2F5D8: cnd_timedwait@@GLIBC_2.34 (cnd_timedwait.c:27)
==3552159==    by 0x7782B10: rd_kafka_q_serve_rkmessages (in /home/dev/.local/lib/python3.11/site-packages/confluent_kafka.libs/librdkafka-55260171.so.1)
==3552159==    by 0x7743C3B: rd_kafka_consume_batch_queue (in /home/dev/.local/lib/python3.11/site-packages/confluent_kafka.libs/librdkafka-55260171.so.1)
==3552159==    by 0x75A5BB6: Consumer_consume (Consumer.c:1056)
==3552159==    by 0x550BBD: method_vectorcall_VARARGS_KEYWORDS.lto_priv.0 (descrobject.c:364)
==3552159==    by 0x53AC2B: UnknownInlinedFun (pycore_call.h:92)
==3552159==    by 0x53AC2B: PyObject_Vectorcall (call.c:299)
==3552159==    by 0x52B93F: _PyEval_EvalFrameDefault (ceval.c:4772)
==3552159==    by 0x5855A3: UnknownInlinedFun (pycore_ceval.h:73)
==3552159==    by 0x5855A3: UnknownInlinedFun (ceval.c:6435)
==3552159==    by 0x5855A3: UnknownInlinedFun (call.c:393)
==3552159==    by 0x5855A3: _PyObject_VectorcallTstate.lto_priv.0 (pycore_call.h:92)
==3552159==    by 0x585173: method_vectorcall.lto_priv.0 (classobject.c:59)
==3552159==    by 0x56A380: UnknownInlinedFun (call.c:257)
==3552159==    by 0x56A380: UnknownInlinedFun (call.c:328)
==3552159==    by 0x56A380: PyObject_Call (call.c:355)
==3552159== 
==3552159== HEAP SUMMARY:
==3552159==     in use at exit: 29,630,515 bytes in 30,288 blocks
==3552159==   total heap usage: 220,309 allocs, 190,021 frees, 522,280,354 bytes allocated
==3552159== 
==3552143== 
==3552143== HEAP SUMMARY:
==3552143==     in use at exit: 29,243,757 bytes in 29,955 blocks
==3552143==   total heap usage: 218,912 allocs, 188,957 frees, 522,173,350 bytes allocated
==3552143== 
==3552143== LEAK SUMMARY:
==3552143==    definitely lost: 576 bytes in 18 blocks
==3552143==    indirectly lost: 0 bytes in 0 blocks
==3552143==      possibly lost: 371,032 bytes in 200 blocks
==3552143==    still reachable: 28,872,149 bytes in 29,737 blocks
==3552143==         suppressed: 0 bytes in 0 blocks
==3552143== Rerun with --leak-check=full to see details of leaked memory
==3552143== 
==3552143== For lists of detected and suppressed errors, rerun with: -s
==3552143== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==3552159== LEAK SUMMARY:
==3552159==    definitely lost: 608 bytes in 19 blocks
==3552159==    indirectly lost: 0 bytes in 0 blocks
==3552159==      possibly lost: 676,795 bytes in 245 blocks
==3552159==    still reachable: 28,953,112 bytes in 30,024 blocks
==3552159==         suppressed: 0 bytes in 0 blocks
==3552159== Rerun with --leak-check=full to see details of leaked memory
==3552159== 
==3552159== For lists of detected and suppressed errors, rerun with: -s
==3552159== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

How to reproduce

I'm afraid I was not able to reproduce it with a simple script.

Clone https://gitlab.softwareheritage.org/swh/devel/swh-dataset.git , check out f8b5c8665ac802f8b9dd15feb11dad2c2eadaee3, then applying this patch (segfaults also happen without it, but it makes them more deterministic):

diff --git a/swh/dataset/journalprocessor.py b/swh/dataset/journalprocessor.py
index ab62cb1..87bdb32 100644
--- a/swh/dataset/journalprocessor.py
+++ b/swh/dataset/journalprocessor.py
@@ -24,6 +24,7 @@
     Set,
     Tuple,
 )
+import gc

 from confluent_kafka import Message, TopicPartition
 import tqdm
@@ -344,6 +345,7 @@ def progress_worker(self, queue=None) -> None:
         active_workers = self.processes
         offset_diff = sum((hi - lo) for lo, hi in self.offsets.values())
         desc = "  - Export"
+        gc.collect()
         with tqdm.tqdm(total=offset_diff, desc=desc, unit_scale=True) as pbar:
             while active_workers:
                 item = queue.get()
Installed Python packages
androguard==3.4.0a1
apksigcopier==1.1.1
appdirs==1.4.4
argcomplete==2.0.0
asn1crypto==1.5.1
asttokens==2.2.1
attrs==22.2.0
backcall==0.2.0
beautifulsoup4==4.11.2
beniget==0.4.1
binwalk==2.3.3
black==22.10.0
Brotli==1.0.9
cassandra-pylib==0.0.0
certifi==2022.9.24
cfgv==3.3.1
chardet==5.1.0
charset-normalizer==3.0.1
click==8.1.3
colorama==0.4.6
contourpy==1.0.7
cryptography==38.0.4
cssselect==1.2.0
cupshelpers==1.0
cycler==0.11.0
dbus-python==1.3.2
decorator==5.1.1
defusedxml==0.7.1
diffoscope==240
distlib==0.3.6
distro==1.8.0
dulwich==0.21.2
EditorConfig==0.12.3
elementpath==2.5.3
enjarify==1.0.3
executing==1.2.0
fastimport==0.9.14
filelock==3.9.0
fonttools==4.38.0
fs==2.4.16
gast==0.5.2
GDAL==3.6.2
gunicorn==20.1.0
html5lib==1.1
httpie==3.2.1
httplib2==0.20.4
identify==2.5.17
idna==3.3
importlib-metadata==4.12.0
iotop==0.6
ipython==8.5.0
jedi==0.18.2
jsbeautifier==1.14.4
jsondiff==2.0.0
kiwisolver==0.0.0
libarchive-c==2.9
libevdev==0.5
lxml==4.9.2
lz4==4.0.2+dfsg
markdown-it-py==2.1.0
matplotlib==3.6.3
matplotlib-inline==0.1.6
mdurl==0.1.2
mechanize==0.4.8
more-itertools==8.10.0
mpmath==0.0.0
msgpack==1.0.3
multidict==6.0.4
mypy-extensions==1.0.0
netifaces==0.11.0
networkx==2.8.8
nodeenv==0.13.4
numpy==1.24.2
olefile==0.46
oscrypto==1.3.0
packaging==23.0
parso==0.8.3
pathspec==0.11.1
pbr==5.10.0
pdfminer.six==20221105
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.4.0
platformdirs==2.6.0
ply==3.11
powerline-status==2.8.3
pre-commit==3.0.4
progressbar==2.5
prompt-toolkit==3.0.36
psutil==5.9.4
ptyprocess==0.7.0
pure-eval==0.0.0
pycairo==1.20.1
pycryptodomex==3.11.0
pycups==2.0.1
pycurl==7.45.2
pydot==1.4.2
Pygments==2.14.0
PyGObject==3.42.2
pygraphviz==1.7
PyOpenGL==3.1.6
pyparsing==3.0.9
pypdf==3.4.1
pyperclip==1.8.2
pyprof2calltree==1.4.5
PyQt5==5.15.9
PyQt5-sip==12.11.1
PyQt6==6.4.2
PyQt6-sip==13.4.1
pyqtgraph==0.13.1
PySimpleSOAP==1.16.2
pysmbc==1.0.23
PySocks==1.7.1
pyte==0.8.0
python-apt==2.6.0
python-dateutil==2.8.2
python-debian==0.1.49
python-debianbts==4.0.1
python-magic==0.4.26
pythran==0.11.0
pytz==2022.7.1
pyudev==0.24.0
pyxattr==0.8.1
PyYAML==6.0
reportbug==12.0.0
requests==2.28.1
requests-toolbelt==0.10.1
rich==13.3.1
rpm==4.18.0
rubber==1.6.0
scipy==1.10.1
scour==0.38.2
six==1.16.0
soupsieve==2.3.2
stack-data==0.6.2
stevedore==4.0.2
sympy==1.11.1
tlsh==0.2.0
tqdm==4.64.1
traitlets==5.5.0
ufoLib2==0.14.0
unicodedata2==15.0.0
urllib3==1.26.12
virtualenv==20.17.1+ds
virtualenv-clone==0.3.0
virtualenvwrapper==4.8.4
wcwidth==0.2.5
webencodings==0.5.1
xkcdpass==1.19.3
xmlschema==1.10.0
xmltodict==0.13.0
zipp==1.0.0
zstd==1.5.2.5

Checklist

Please provide the following information:

  • [x] confluent-kafka-python and librdkafka version (confluent_kafka.version() and confluent_kafka.libversion()): ('2.4.0', 33816576) and ('2.4.0', 33816831)

  • [ ] Apache Kafka broker version: none

  • [x] Client configuration: {'auto.offset.reset': 'earliest', 'bootstrap.servers': '127.0.0.1:45305', 'debug': 'all', 'enable.auto.commit': False, 'enable.partition.eof': True, 'error_cb': <function _error_cb at 0x7efc3959aa20>, 'group.id': 'test-consumer-gqizofzwretest_parallel_journal_processor', 'logger': <Logger swh.journal.client.rdkafka (DEBUG)>, 'message.max.bytes': '524288000', 'on_commit': <function _on_commit at 0x7efc3959bce0>}

  • [x] Operating system: Debian 12

  • [x] Provide client logs (with 'debug': '..' as necessary): crash_log.txt

  • [ ] Provide broker log excerpts

  • [ ] Critical issue

progval avatar Jun 04 '24 12:06 progval