pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[fix][client] Fix the blocked producer due to chunking when blockIfQueueFull is enabled

Open Gleiphir2769 opened this issue 2 years ago • 2 comments

Fixes #17446

Motivation

Producer may be permanently blocked by chunking messages when blockIfQueueFull is enabled.

The reason for this bug is how the chunk message semaphore is acquired. https://github.com/apache/pulsar/blob/359cfa7bc05775bf6dd004f21b9907610ed3b3d5/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ProducerImpl.java#L520-L527 When a large message is split into a large number of chunks (i.e. the message is too big or the chunkMaxMessageSize is set too small), all the remaining semaphores will be acquired. The sending (send() and sendAsync()) of large message will be blocked by itself forever.

By the way, once blockIfQueueFull/maxPendingMessages/chunking are enabled at the same time, this risk of deadlock exists even if the number of chunks of a single message is not very large.

Modifications

When chunking is enabled, the blockIfQueueFull is always disabled.

Verifying this change

  • [x] Make sure that the change passes the CI checks.

This change added tests and can be verified as follows:

  • Added integration tests for end-to-end deployment with large payloads (10MB)

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API: (yes / no)
  • The schema: (yes / no / don't know)
  • The default values of configurations: (yes / no)
  • The wire protocol: (yes / no)
  • The rest endpoints: (yes / no)
  • The admin cli options: (yes / no)
  • Anything that affects deployment: (yes / no / don't know)

Documentation

Check the box below or label this PR directly.

Need to update docs?

  • [ ] doc-required (Your PR needs to update docs and you will update later)

  • [x] doc-not-needed (Please explain why)

  • [ ] doc (Your PR contains doc changes)

  • [ ] doc-complete (Docs have been already added)

Gleiphir2769 avatar Sep 03 '22 09:09 Gleiphir2769

Does this happen in real world scenario? even if a chunk message take all remaining send permits, sending for earlier messages will be completed and release permits, unless one big chunk message will take up all available permits, in such case more like a configuration issue, should increase maxPendingMessages count?

MarvinCai avatar Sep 03 '22 14:09 MarvinCai

Does this happen in real world scenario? even if a chunk message take all remaining send permits, sending for earlier messages will be completed and release permits, unless one big chunk message will take up all available permits, in such case more like a configuration issue, should increase maxPendingMessages count?

Hi @MarvinCai. It is not only happens in one big chunk message. If a client is sending big chunking messages concurrently, it's more easy to take all remaining send permits and no more permits can be released. The most important point is that the sending of the chunk may acquire permits that it cannot release, risk remains even with increased maxPendingMessages .

Gleiphir2769 avatar Sep 03 '22 16:09 Gleiphir2769

The pr had no activity for 30 days, mark with Stale label.

github-actions[bot] avatar Oct 21 '22 02:10 github-actions[bot]