pulsar-client-python icon indicating copy to clipboard operation
pulsar-client-python copied to clipboard

[Bug] Segmentation fault when properties exceed ~100kb

Open emanueledomingo opened this issue 1 year ago • 6 comments

Hi everyone, i'm running an event driven app and i use apache pulsar as backbone. I use message's properties to exchange metadata between the services.

I noted that, when the properties exceed ~100kb the client gives Segmentation fault (core dumped) (sometimes corrupted double-linked list instead)

I haven't done a proper test on the size, so take the size as a guideline

I was able to simulate the behaviour in a notebook:

import pulsar
import sys
import json

# need to serialize to json nested object because client accepts properties only
# of type (self: _pulsar.MessageBuilder, arg0: str, arg1: str)
nested_dict = json.dumps({
    "foo": "bar"
})

big_dict = {
    f"key{i}": nested_dict for i in range(3000)
}

print(sys.getsizeof(big_dict))  # 103856

client = pulsar.Client('pulsar://localhost:6650')
producer = client.create_producer('test-topic')

producer.send(content="test-message".encode(), properties=big_dict)  # segfault

Unfortunately i don't have details in the stacktrace:

In [4]: producer.send(content="test-message".encode(), properties=big_dict)
Segmentation fault (core dumped)

Pulsar client version is 3.5.0.

emanueledomingo avatar Aug 27 '24 08:08 emanueledomingo

🙈 I made a mistake in moving this issue. I first thought that it was in the apache/pulsar repository and transferred it to pulsar-client-cpp. I'm sorry about that. Perhaps this is pulsar-client-cpp related in any case.

lhotari avatar Aug 27 '24 15:08 lhotari

Ok, thanks! I also think it has to do with the core c++ implementation. But as a python user I wasn't sure.

emanueledomingo avatar Aug 27 '24 16:08 emanueledomingo

Yeah, it's related to the C++ core. And it's a known issue. It might be easy to fix. I will take some time for it when I'm free.

BewareMyPower avatar Aug 29 '24 08:08 BewareMyPower

I just double checked the issue and it looks like a bug with protobuf:

apache/pulsar-client-cpp#1  0x0000ffffb1a58d2c in pulsar::proto::KeyValue::_InternalSerialize(unsigned char*, google::protobuf::io::EpsCopyOutputStream*) const () from /usr/local/lib/python3.10/dist-packages/pulsar_client.libs/libpulsar-8dc83e12.so
apache/pulsar-client-cpp#2  0x0000ffffb1cc50e0 in google::protobuf::internal::WireFormatLite::InternalWriteMessage(int, google::protobuf::MessageLite const&, int, unsigned char*, google::protobuf::io::EpsCopyOutputStream*) ()
   from /usr/local/lib/python3.10/dist-packages/pulsar_client.libs/libpulsar-8dc83e12.so
apache/pulsar-client-cpp#3  0x0000ffffb1a5b718 in pulsar::proto::MessageMetadata::_InternalSerialize(unsigned char*, google::protobuf::io::EpsCopyOutputStream*) const () from /usr/local/lib/python3.10/dist-packages/pulsar_client.libs/libpulsar-8dc83e12.so
apache/pulsar-client-cpp#4  0x0000ffffb1c96478 in google::protobuf::SerializeToArrayImpl(google::protobuf::MessageLite const&, unsigned char*, int) () from /usr/local/lib/python3.10/dist-packages/pulsar_client.libs/libpulsar-8dc83e12.so
apache/pulsar-client-cpp#5  0x0000ffffb1c94e98 in google::protobuf::MessageLite::SerializePartialToArray(void*, int) const () from /usr/local/lib/python3.10/dist-packages/pulsar_client.libs/libpulsar-8dc83e12.so
apache/pulsar-client-cpp#6  0x0000ffffb1c94d50 in google::protobuf::MessageLite::SerializeToArray(void*, int) const () from /usr/local/lib/python3.10/dist-packages/pulsar_client.libs/libpulsar-8dc83e12.so
apache/pulsar-client-cpp#7  0x0000ffffb1920704 in pulsar::Commands::newSend(pulsar::SharedBuffer&, pulsar::proto::BaseCommand&, pulsar::Commands::ChecksumType, pulsar::SendArguments const&) ()
   from /usr/local/lib/python3.10/dist-packages/pulsar_client.libs/libpulsar-8dc83e12.so

I tried the python client on macOS and it does not have this error but encountered the same issue for Ubuntu.

However, I tried the C++ client 3.5.1 on Ubuntu and this error does not happen:

#include <pulsar/Client.h>
using namespace pulsar;

int main() {
    const std::string topic = "test-topic";
    Client client("pulsar://host.docker.internal:6650");
    Producer producer;
    client.createProducer(topic, producer);

    MessageBuilder::StringMap properties;
    for (int i = 0; i < 3000; i++) {
        properties["key" + std::to_string(i)] = "{\"foo\": \"bar\"}";
    }
    auto msg = MessageBuilder().setProperties(properties).setContent("test-message").build();
    producer.send(msg);

    client.close();
}

Increasing the i in loop does not fail with segfault as well. (if the properties are too big, ResultMessageTooLarge will be returned)

It might be an issue from the Python client side. So let me move it to pulsar-client-python first.

BewareMyPower avatar Aug 29 '24 10:08 BewareMyPower

Oh I realized where the issue is, there is a bug with the C++ client when batching is disabled. However, the Python client somehow disables batching by default.

You can manually enable the batching to get the issue around, while I'm going to push a fix at the C++ client.

I also pushed a PR (https://github.com/apache/pulsar-client-python/pull/224) to enable batching by default.

BewareMyPower avatar Aug 29 '24 12:08 BewareMyPower

I confirm that by enabling batching, the bug does not appear. Thanks for the workaround suggestion!

emanueledomingo avatar Aug 29 '24 14:08 emanueledomingo