confluent-kafka-python
confluent-kafka-python copied to clipboard
Segfault of binary wheel on Red Hat Enterprise 8 Python 3.6/3.8
Description
When I try to use the Python 3.8 binary wheel on RHEL 8 and instantiate a confluent_kafka.Producer instance using SSL, I get a segmentation fault.
How to reproduce
Run the following steps on a RHEL8 workstation:
dnf install python38 python38-devel python38-pip python38-pip-wheel python38-setuptools python38-setuptools-wheel
python3.8 -mvenv venv-kafka
. venv-kafka/bin/activate
pip install -U pip wheel setuptools
pip install confluent_kafka # This will install the publicly available binary wheel for confluent-kafka
And then, in the created virtualenv, run a script to instantiate a Producer using SSL. Here is my sample script:
import confluent_kafka
config={
"bootstrap.servers": "my-kafka-broker.local:8443",
"security.protocol": "SSL",
"ssl.ca.location": "/etc/pki/tls/certs/ca-bundle.crt",
"ssl.certificate.location": "me.crt",
"ssl.key.location": "me.key",
}
p = confluent_kafka.Producer(config)
print(p.list_topics().brokers)
Checklist
Please provide the following information:
- [x] confluent-kafka-python and librdkafka version (
confluent_kafka.version()andconfluent_kafka.libversion()):confluent_kafka.version():('2.0.2', 33554944)confluent_kafka.libversion():('2.0.2', 33555199)- To be noted: I am using the binary wheels that are publicly provided;
- [x] Apache Kafka broker version: I do not know the version of the broker, but we are using the Confluent Kafka server distribution, at version 7.3.1;
- [x] Client configuration: you can see it in my sample script;
- [x] Operating system: Linux, Red Hat Enterprise Linux release 8.5 (Ootpa)
- [x] Provide client logs (with
'debug': '..'as necessary) - no logs were emitted as far as I could see; - [x] Provide broker log excerpts - no connection is made to the broker;
- [ ] Critical issue - this issue is preventing us from deploying the new version of the confluent Kafka client, but I am not sure I would rate it as critical;
Diagnosis information
Following the instructions on https://access.redhat.com/solutions/56021 I was able to obtain a core dump.
There is also a JSON file that was produced by the system itself containing the stack trace information of the error, and the point that I found interesting is that the stack trace seems to go through the CRYPTO_THREAD_read_lock defined in /usr/lib64/libcrypto.so.1.1.1k (instead of using the version defined in librdkafka-e18edbcd.so.1)
Other notes about the issue:
- I was able to reproduce the issue on RHEL 8 Python 3.6;
- The issue does not occur on RHEL 8 Python 3.9;
- The issue does not occur on RHEL 7, whether on Python 3.6 or Python 3.8;
Thank you for your help.
Is it possible to reproduce this after adding a few debug logs? To add the logs, please change the config to add the property "debug", with the value "all".
It will also give us the information about what version of OpenSSL it's working against, how it's being linked, etc.
At a glance, seems like it's an issue related to OpenSSL, I see a similar issue in the OpenSSL repository, https://github.com/openssl/openssl/issues/13469 . (It's not exactly the same thing, but it might not be unrelated. The OpenSSL version being used is from Mar 2021, 1.1.1k)
Either way, the debug logs will be helpful for further understanding the issue.
Also it's worth checking if you have OPENSSL_CONF or OPENSSL_MODULES env variable defined, or any other openssl env variable
debug=all log output
Here is the console output when I add debug: all to the configuration in the above script.
confluent-kafka-python version: ('2.0.2', 33554944)
librdkafka version: ('2.0.2', 33555199)
%7|1676914550.376|OPENSSL|rdkafka#producer-1| [thrd:app]: Using statically linked OpenSSL version OpenSSL 1.1.1k FIPS 25 Mar 2021 (0x101010bf, librdkafka built with 0x30000070)
Segmentation fault (core dumped)
As mentioned in the request, I am using the binary wheel that is downloaded from pypi.
Environment variables
No OPENSSL_* environment variable is defined when I reproduce the issue.
Looking at the backtrace in more detail
If you look at the provided core_backtrace.json in the attached zip file, you would see the following first three lines in the backtrace :
[
{
"address": 140491546574294,
"build_id": "a5a5466e6834e61eaf42ea2ffed99dfe637a98d5",
"build_id_offset": 52694,
"function_name": "__pthread_rwlock_rdlock",
"file_name": "/usr/lib64/libpthread-2.28.so"
},
{
"address": 140491558911677,
"build_id": "1d8171e86254ca15d3543450a8888178dd288b8e",
"build_id_offset": 2031293,
"function_name": "CRYPTO_THREAD_read_lock",
"file_name": "/usr/lib64/libcrypto.so.1.1.1k"
},
{
"address": 140491308366989,
"build_id": "bbc92339b8c04c8d1a9a2eb29cacf974f11b8438",
"build_id_offset": 4385933,
"function_name": "ossl_lib_ctx_get_data",
"file_name": "/home/ldap/hominaje/venv-kafka/lib/python3.8/site-packages/confluent_kafka.libs/librdkafka-e18edbcd.so.1"
}
]
You can see in that backtrace that:
- The
ossl_lib_ctx_get_datais provided by the statically linked version of ossl shipped byconfluent_kafka.libs/librdkafka-e18edbcd.so.1; - The previous function, that corresponds to a call to
libcrypto, is provided by the system library/usr/lib64/libcrypto.so.1.1.1k;
Given the tight coupling between libssl and libcrypto, I would not expect a statically linked version of openssl to work with the system version of libcrypto.
About the linked issue on OpenSSL
I would be suprised if there was a link:
- The issue was reported on "OpenSSL 3.0.0-alpha9-dev", when RHEL 8 ships only OpenSSL 1.1.1k;
- The function in which the fix was made for that issue,
ossl_ctx_thread_stop, does not appear in the stack trace of the segfault being reported;
However, the existence of a link may not matter - as the bug occurs while using the OpenSSL version that is statically linked by you in the manylinux wheel on PyPI, any bug in that OpenSSL binary would have to be fixed by you publishing a new version of that binary manylinux wheel.
I was able to reproduce the issue in docker image of RHEL 8.5. We found that the python version is dynamically loading libcrypto.so.1.1.1k for python 3.6 and 3.8. confluent-kafka-python wheel contains statically linked libcrypto 3.0.7. (Using statically linked OpenSSL version OpenSSL 1.1.1k FIPS 25 Mar 2021 (0x101010bf, librdkafka built with 0x30000070) Segmentation fault (core dumped))
Python 3.8 in Redhat
[root@docker-desktop ~]# ldd /venv-kafka/bin/python3.8
linux-vdso.so.1 (0x00007ffc59bf9000)
libcrypto.so.1.1 => /lib64/libcrypto.so.1.1 (0x00007fc8ee72a000)
libpython3.8.so.1.0 => /lib64/libpython3.8.so.1.0 (0x00007fc8ee192000)
libcrypt.so.1 => /lib64/libcrypt.so.1 (0x00007fc8edf69000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fc8edd49000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007fc8edb45000)
libutil.so.1 => /lib64/libutil.so.1 (0x00007fc8ed941000)
libm.so.6 => /lib64/libm.so.6 (0x00007fc8ed5bf000)
libc.so.6 => /lib64/libc.so.6 (0x00007fc8ed1f9000)
libz.so.1 => /lib64/libz.so.1 (0x00007fc8ecfe1000)
/lib64/ld-linux-x86-64.so.2 (0x00007fc8eee15000)
Python 3.9 doesn't have this dynamical linking and hence it works.
We used same 3.8 and 3.6 python versions in other linux distributions and it works fine there. The dynamic loading of libcrypto is not present with python there. I feel that Redhat does some changes to python in order to make it FIPS compliant. (OpenSSL 1.1.1k FIPS in the debug log)
Python 3.8 in ubuntu
# ldd /usr/bin/python3.8
linux-vdso.so.1 (0x00007ffe867d8000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f0f9cbc3000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f0f9cba0000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f0f9cb9a000)
libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f0f9cb95000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f0f9ca46000)
libexpat.so.1 => /lib/x86_64-linux-gnu/libexpat.so.1 (0x00007f0f9ca18000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f0f9c9fa000)
/lib64/ld-linux-x86-64.so.2 (0x00007f0f9cdbb000)
I would recommend you to use python 3.9 or different distribution of python for the same version (3.8 or 3.6 latest version).
Thank you for your answer!
We will take that issue up to Red Hat to see if they have any thing to say about that.
Hello!
I can confirm that we are affected by this issue at work as well. The instantiation of a Consumer instance with SSL enabled immediately leads to a segmentation fault.
The segfault occurs with the default Python 3.6 running on RHEL 8. But it works fine with Python 3.6 (rh-python36 from rhscl) and RHEL 7.
@jhominal did this issue end up being resolved for you? @Encrypt did the discussion above help at all with your issue as well?
Hello @nhaq-confluent !
We ended up rebuilding our servers. We are currently using version 1.9.2 of the confluent-kafka package (we couldn't use the latest one in our project, I don't remember why), with Python 3.9 on RHEL 8.
I can confirm that this combination works, as mentioned on the first post above :slightly_smiling_face:
Hello everyone!
Just to keep you up to date, I tried installing the latest version (v2.3.0) of the confluent-kafka "pre-built" library on RHEL 8 with Python 3.9. It didn't work, resulting in a segfault.
So, I ended up following the "Install from source" documentation, which worked perfectly :slightly_smiling_face: