pulsar-client-cpp icon indicating copy to clipboard operation
pulsar-client-cpp copied to clipboard

Key-based batching and hashingScheme parameter for Python client

Open matejsaravanja opened this issue 5 years ago • 5 comments

Short intro We're running a platform that's effectively just a bunch of microservices written in Python, Java and Go which are communicating through Kafka. Currently we're considering moving from Kafka to Pulsar. However, one of the main features we need is key-based message routing for solving concurrency problems. Pulsar has that option with KeyShared subscription mode.

The problem In Python, KeyShared subscription works only if producer disables batching. I ran some tests in my local environment and disabling batching results in high decrease of throughput (~25k/s with batching vs ~9k/s without).

Features that would've solved the problem

  • Key-based batching for Python client
  • HashingScheme parameter when creating producer so all of our services (Java, Python and Go) could have the same hashing scheme

matejsaravanja avatar May 12 '20 08:05 matejsaravanja

You could create multiple producers and deliver the message to the right producer associated with the key.

However, I suspected that it's broker's problem. Here is my experiment:

First, I created 2 producers and 2 consumers. The producers' batching max publish delay ms are both 3000, which is large enough for batching. Then I used one producer to send messages with key "A", and another producer to send messages with key "B". Only one consumer could receive the messages:

topic: persistent://public/default/FooTest | payload: 0 | key: A | id: (3084,40,-1,0)
topic: persistent://public/default/FooTest | payload: 2 | key: A | id: (3084,40,-1,1)
topic: persistent://public/default/FooTest | payload: 4 | key: A | id: (3084,40,-1,2)
topic: persistent://public/default/FooTest | payload: 6 | key: A | id: (3084,40,-1,3)
topic: persistent://public/default/FooTest | payload: 8 | key: A | id: (3084,40,-1,4)
topic: persistent://public/default/FooTest | payload: 1 | key: B | id: (3084,41,-1,0)
topic: persistent://public/default/FooTest | payload: 3 | key: B | id: (3084,41,-1,1)
topic: persistent://public/default/FooTest | payload: 5 | key: B | id: (3084,41,-1,2)
topic: persistent://public/default/FooTest | payload: 7 | key: B | id: (3084,41,-1,3)
topic: persistent://public/default/FooTest | payload: 9 | key: B | id: (3084,41,-1,4)

From the 4th field of message id, which is the batch index, you can see 10 messages of 2 keys were sent by 2 batched messages. But if I disabled batching and sent the messages again, one consumer received messages with key "A", another consumer received messages with key "B".

BewareMyPower avatar May 27 '20 15:05 BewareMyPower

For what I've mentioned before:

However, I suspected that it's broker's problem.

Yeah, it's a problem of 2.5.x. When I upgraded broker to 2.6.0, the problem disappeared. Though I didn't know which commit solved this problem.

Currently, creating multiple producers and choosing the producer associated with the key could achieve the same goal of key-based batching. And I would implement key-based batching for C++ client soon.

BewareMyPower avatar Jul 01 '20 04:07 BewareMyPower

We changed our architecture a bit so key-based batching is not necessary anymore, but before that, creating multiple producers to imitate key-based batching wasn't an option because of potential scaling issues. If you implement it, keep us posted 😄

matejsaravanja avatar Jul 01 '20 07:07 matejsaravanja

I've just pushed a PR apache/pulsar#7996 to support key based batching. But I'm not familiar with the python wrapper so I didn't add the batching type config to Python client yet.

BewareMyPower avatar Sep 07 '20 11:09 BewareMyPower

apache/pulsar#8185 added the key-based batching support. HashingScheme is not supported yet.

BewareMyPower avatar Oct 28 '22 03:10 BewareMyPower