pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[fix][client] Fix clearIncomingMessages so that it doesn't leak memory while new entries are added

Open lhotari opened this issue 1 year ago • 5 comments

Motivation

The current ConsumerBase.clearIncomingMessage has a race condition when using Shared or Key_Shared subscription type.

https://github.com/apache/pulsar/blob/69a45a11e22c1d963391110fb3488c9b9f98f759/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ConsumerBase.java#L1224-L1229

When using Shared or Key_Shared, this method is called in redeliverUnacknowledgedMessages while new entries are flowing to the client. This would leak memory and memory limit counters would get skewed.

Modifications

  • Pull messages from the queue and release them one by one.
  • Return the number of messages that were removed.

Documentation

  • [ ] doc
  • [ ] doc-required
  • [x] doc-not-needed
  • [ ] doc-complete

lhotari avatar Dec 20 '23 13:12 lhotari

Codecov Report

:x: Patch coverage is 84.37500% with 5 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 73.40%. Comparing base (69a45a1) to head (7f4bd25). :warning: Report is 1487 commits behind head on master.

Files with missing lines Patch % Lines
...va/org/apache/pulsar/client/impl/ConsumerImpl.java 78.94% 3 Missing and 1 partial :warning:
...ache/pulsar/client/impl/ZeroQueueConsumerImpl.java 50.00% 1 Missing :warning:
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #21767      +/-   ##
============================================
- Coverage     73.43%   73.40%   -0.03%     
+ Complexity    32798    32772      -26     
============================================
  Files          1897     1897              
  Lines        140647   140646       -1     
  Branches      15489    15491       +2     
============================================
- Hits         103290   103248      -42     
- Misses        29283    29316      +33     
- Partials       8074     8082       +8     
Flag Coverage Δ
inttests 24.19% <34.37%> (+0.02%) :arrow_up:
systests 24.79% <34.37%> (+0.03%) :arrow_up:
unittests 72.69% <84.37%> (-0.05%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...a/org/apache/pulsar/client/impl/RawReaderImpl.java 83.65% <100.00%> (ø)
...va/org/apache/pulsar/client/impl/ConsumerBase.java 73.83% <100.00%> (-0.30%) :arrow_down:
...he/pulsar/client/impl/MultiTopicsConsumerImpl.java 77.87% <100.00%> (-0.71%) :arrow_down:
...ache/pulsar/client/impl/ZeroQueueConsumerImpl.java 66.25% <50.00%> (+0.81%) :arrow_up:
...va/org/apache/pulsar/client/impl/ConsumerImpl.java 78.00% <78.94%> (-0.02%) :arrow_down:

... and 53 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov-commenter avatar Dec 20 '23 20:12 codecov-commenter

Hi @lhotari, is it possible to reproduce the issue? Or is it possible to inject some delay to reproduce the issue with a test? So that we can avoid regression and easily understand what problem is fixed.

codelipenghui avatar Dec 21 '23 22:12 codelipenghui

Hi @lhotari, is it possible to reproduce the issue? Or is it possible to inject some delay to reproduce the issue with a test? So that we can avoid regression and easily understand what problem is fixed.

@codelipenghui I'll try to do that. However, it's pretty clear from the code that there are multiple race conditions that this PR would address. One of the races is caused by the NoOpLock used for shared subscription.

lhotari avatar Dec 22 '23 06:12 lhotari

One issue that I was able to reproduce, I haven't yet checked whether this PR fixes the problem. The repro app might be about a different issue.

A very messy repro app that does nasty things: https://github.com/lhotari/pulsar-playground/blob/lh-PR21767-investigation/src/main/java/com/github/lhotari/pulsar/playground/TestScenarioIssueRedeliveries.java

repro:

git clone -b lh-PR21767-investigation https://github.com/lhotari/pulsar-playground
cd pulsar-playground
./gradlew shadowJar
# start Pulsar in docker in the background
docker run --rm --name pulsar-standalone -d -p 8080:8080 -p 6650:6650 apachepulsar/pulsar:3.1.1 /pulsar/bin/pulsar standalone -nss -nfw
# wait a few seconds for Pulsar to start
sleep 5
# run the test app using the shadow jar with all dependencies
java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioIssueRedeliveries
# stop and remove Pulsar in docker
docker stop pulsar-standalone

All messages should be eventually received. In the test case, about 5 to 15 out of 10000 are usually lost.

The test app prints how many are remaining (lost)

[main] INFO com.github.lhotari.pulsar.playground.TestScenarioIssueRedeliveries - Done receiving. Remaining: 10 duplicates: 8704 reconsumed: 985
[main] ERROR com.github.lhotari.pulsar.playground.TestScenarioIssueRedeliveries - Not all messages received. Remaining: 10

UPDATE: This issue reproduces even with the changes in this PR, so it's not fixed by this PR 21767.

I made changes to the test app build so that a shadow jar can be built with the locally built snapshot version of the Pulsar client.

# build the local version of Pulsar with this command before building the shadow jar:
# mvn -Pcore-modules,-main -T 1C clean install -DskipTests -Dspotbugs.skip=true`
# then build the shadow jar, check the previous instructions how to build
PULSAR_VERSION=3.2.0-SNAPSHOT ./gradlew shadowJar

UPDATE 2: The test case was bad. I had forgotten DLQ config with max redeliveries 5 and that caused the problem.

lhotari avatar Dec 22 '23 18:12 lhotari

UPDATE 3 about the repro app:

When I increase the number of messages to 1M, the processing gets stuck in a loop where it's only the redelivered messages that keep on rotating. That is expected in many ways when there's a frequent call to consumer.redeliverUnacknowledgedMessages(). Calling that method is causing issues. I wonder how many Pulsar applications in the wild contain this type of mistake?

Could we improve the documentation for redeliverUnacknowledgedMessages to reduce confusion? What is a valid use case to use this method? Is it really needed in the user level API?

Since failover/shared subscription types are essentially calling consumer.redeliverUnacknowledgedMessages() under the covers, it's possible that a Pulsar application gets into this vicious cycle where the number of failed messages is very high and very little progress, or in worst case no progress, will be made between the redelivery calls. There is the consumer epoch solution to help with the problem, but I could see how it doesn't help prevent the situation when there's a large amount of messages that are in redelivery and it keeps the loop between the client and the broker busy. It seems that such a problem could be detected from the relivery counters in topic stats, but that's something that isn't visible in Grafana / Prometheus metrics directly, IIRC.

In most cases the situation would get resolved after a while, but only after burning a lot of CPU cycles and transferred bytes across the network. I wonder if there's a better way to deal with high error rates in Pulsar?

lhotari avatar Dec 22 '23 20:12 lhotari