[improve][broker] Improve replicated subscription snapshot cache so that subscriptions can be replicated when mark delete position update is not frequent
Motivation
There's a known issue with replicated subscriptions that the subscription position doesn't get updated to the remote cluster when the mark delete position is not updated frequently in the local cluster. This could often happen when using shared subscriptions which use individual acknowledgements.
Modifications
- optimize the memory footprint of replication snapshot cache so that more entries could be kept in memory without consuming significantly more memory
- optimize replication snapshot cache eviction
- when the cache fills up, evict the entry which has the minimum distance to its adjacent entries in the snapshot list
Documentation
- [ ]
doc - [ ]
doc-required - [x]
doc-not-needed - [ ]
doc-complete
I’ll look through the draft later. BTW, is there anything I can help with? 👀
I’ll look through the draft later. BTW, is there anything I can help with? 👀
@Ksnz Thank you. For this particular PR, I think that review feedback, challenging the assumptions and testing the solution in real-world scenarios is most helpful.
I pushed a follow up PR to address the performance and it's now very good since 1 million snapshots are added (while handling eviction) to the cache in just over 1 second (it was around 17 to 18 seconds without optimizing the algorithm).
Each snapshot consumes around 200 bytes of memory.
With max snapshot size of 100 snapshots and 5000 replicated subscriptions, in the worst case scenario, about 100 MB of heap would be consumed for snapshots. That seems reasonable, and I think that increasing from replicatedSubscriptionsSnapshotMaxCachedPerSubscription from 10 to 100 is a viable new default configuration option.
/pulsarbot rerun-failure-checks
It seems that there might be a bug in replicated subscriptions related to the snapshot cache. The position used for comparison in the snapshot cache is the position of the marker message, not the snapshot request position. I think it should be the snapshot request position, since that position is the last position guaranteed to be in sync, as long as new messages are produced in only one cluster at a time. In the current solution, since the position used is the marker message position from the last snapshot response, the acknowledged position will be incorrect if new messages continue to be produced while snapshotting is happening but not all messages are acknowledged. It should be possible to reproduce this bug in a test.
It seems that there might be a bug in replicated subscriptions related to the snapshot cache. The position used for comparison in the snapshot cache is the position of the marker message, not the snapshot request position. I think it should be the snapshot request position, since that position is the last position guaranteed to be in sync, as long as new messages are produced in only one cluster at a time. In the current solution, since the position used is the marker message position from the last snapshot response, the acknowledged position will be incorrect if new messages continue to be produced while snapshotting is happening but not all messages are acknowledged. It should be possible to reproduce this bug in a test.
Actually it should be fine, since the remote cluster will only acknowledge up to the message id of the snapshot. However, the assumption of replicated subscription is that one cluster is active and producing messages at a time. If there are multiple active clusters, it seems that there could be corner cases where messages would get skipped.
es continue to be produced while snapshotting is happening but not all messages are acknowledged. It should be possible to reproduce this bug in a test.
After all, this bug does exist. I had made a previous attempt in #16651 to fix it and now I've revisited the solution so that it makes more sense and is now correct.
#16651 fixes 2 separate but related issues. One where there's a race condition between the snapshot creation and markdelete updates and another one where the wrong position is used for the snapshot position. The snapshot's position should be the position when the snapshot request was created since that's the point which is guaranteed to be in sync. Any further position could contain messages from other clusters etc.. Another problem that it solves is that a snapshot couldn't be used when the snapshot position was farther ahead that it actually should have been.
es continue to be produced while snapshotting is happening but not all messages are acknowledged. It should be possible to reproduce this bug in a test.
After all, this bug does exist. I had made a previous attempt in #16651 to fix it and now I've revisited the solution so that it makes more sense and is now correct.
#16651 fixes 2 separate but related issues. One where there's a race condition between the snapshot creation and markdelete updates and another one where the wrong position is used for the snapshot position. The snapshot's position should be the position when the snapshot request was created since that's the point which is guaranteed to be in sync. Any further position could contain messages from other clusters etc.. Another problem that it solves is that a snapshot couldn't be used when the snapshot position was farther ahead that it actually should have been.
I revisited PR 16651 once again. The description and the changes contain the latest status of those changes. Besides the race condition issue fix, there's a fix for the wrong position issue. The issue was slightly different than what I thought before. The wrong position issue was about responding to the request with a response which uses the position where the message id is the latest message id in the topic. That didn't make sense to me and I changed it to use the position of the request marker message.
Codecov Report
:x: Patch coverage is 86.76471% with 18 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 74.36%. Comparing base (89f6015) to head (cb55b65).
:warning: Report is 8 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #25044 +/- ##
=============================================
+ Coverage 38.64% 74.36% +35.71%
- Complexity 13341 33776 +20435
=============================================
Files 1864 1921 +57
Lines 146223 150531 +4308
Branches 16984 17492 +508
=============================================
+ Hits 56506 111940 +55434
+ Misses 82051 29663 -52388
- Partials 7666 8928 +1262
| Flag | Coverage Δ | |
|---|---|---|
| inttests | 26.31% <0.00%> (-0.13%) |
:arrow_down: |
| systests | 22.84% <0.00%> (-0.06%) |
:arrow_down: |
| unittests | 73.90% <86.76%> (+39.07%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Files with missing lines | Coverage Δ | |
|---|---|---|
| ...org/apache/pulsar/broker/ServiceConfiguration.java | 98.22% <ø> (+3.05%) |
:arrow_up: |
| ...ker/service/persistent/PersistentSubscription.java | 76.50% <100.00%> (+28.24%) |
:arrow_up: |
| .../persistent/ReplicatedSubscriptionsController.java | 74.35% <50.00%> (+74.35%) |
:arrow_up: |
| ...ersistent/ReplicatedSubscriptionSnapshotCache.java | 84.17% <88.09%> (+84.17%) |
:arrow_up: |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.