pulsar [improve][broker] Improve replicated subscription snapshot cache so that subscriptions can be replicated when mark delete position update is not frequent

Motivation

There's a known issue with replicated subscriptions that the subscription position doesn't get updated to the remote cluster when the mark delete position is not updated frequently in the local cluster. This could often happen when using shared subscriptions which use individual acknowledgements.

Modifications

optimize the memory footprint of replication snapshot cache so that more entries could be kept in memory without consuming significantly more memory
optimize replication snapshot cache eviction
- when the cache fills up, evict the entry which has the minimum distance to its adjacent entries in the snapshot list

Documentation

[ ] doc
[ ] doc-required
[x] doc-not-needed
[ ] doc-complete

Dec 05 '25 15:12 lhotari

I’ll look through the draft later. BTW, is there anything I can help with? 👀

Dec 05 '25 16:12 Ksnz

I’ll look through the draft later. BTW, is there anything I can help with? 👀

@Ksnz Thank you. For this particular PR, I think that review feedback, challenging the assumptions and testing the solution in real-world scenarios is most helpful.

I pushed a follow up PR to address the performance and it's now very good since 1 million snapshots are added (while handling eviction) to the cache in just over 1 second (it was around 17 to 18 seconds without optimizing the algorithm).

Each snapshot consumes around 200 bytes of memory. With max snapshot size of 100 snapshots and 5000 replicated subscriptions, in the worst case scenario, about 100 MB of heap would be consumed for snapshots. That seems reasonable, and I think that increasing from replicatedSubscriptionsSnapshotMaxCachedPerSubscription from 10 to 100 is a viable new default configuration option.

Dec 06 '25 19:12 lhotari

/pulsarbot rerun-failure-checks

Dec 07 '25 10:12 lhotari

It seems that there might be a bug in replicated subscriptions related to the snapshot cache. The position used for comparison in the snapshot cache is the position of the marker message, not the snapshot request position. I think it should be the snapshot request position, since that position is the last position guaranteed to be in sync, as long as new messages are produced in only one cluster at a time. In the current solution, since the position used is the marker message position from the last snapshot response, the acknowledged position will be incorrect if new messages continue to be produced while snapshotting is happening but not all messages are acknowledged. It should be possible to reproduce this bug in a test.

Dec 08 '25 06:12 lhotari

It seems that there might be a bug in replicated subscriptions related to the snapshot cache. The position used for comparison in the snapshot cache is the position of the marker message, not the snapshot request position. I think it should be the snapshot request position, since that position is the last position guaranteed to be in sync, as long as new messages are produced in only one cluster at a time. In the current solution, since the position used is the marker message position from the last snapshot response, the acknowledged position will be incorrect if new messages continue to be produced while snapshotting is happening but not all messages are acknowledged. It should be possible to reproduce this bug in a test.

Actually it should be fine, since the remote cluster will only acknowledge up to the message id of the snapshot. However, the assumption of replicated subscription is that one cluster is active and producing messages at a time. If there are multiple active clusters, it seems that there could be corner cases where messages would get skipped.

Dec 08 '25 07:12 lhotari

es continue to be produced while snapshotting is happening but not all messages are acknowledged. It should be possible to reproduce this bug in a test.

After all, this bug does exist. I had made a previous attempt in #16651 to fix it and now I've revisited the solution so that it makes more sense and is now correct.

#16651 fixes 2 separate but related issues. One where there's a race condition between the snapshot creation and markdelete updates and another one where the wrong position is used for the snapshot position. The snapshot's position should be the position when the snapshot request was created since that's the point which is guaranteed to be in sync. Any further position could contain messages from other clusters etc.. Another problem that it solves is that a snapshot couldn't be used when the snapshot position was farther ahead that it actually should have been.

Dec 08 '25 12:12 lhotari

es continue to be produced while snapshotting is happening but not all messages are acknowledged. It should be possible to reproduce this bug in a test.

After all, this bug does exist. I had made a previous attempt in #16651 to fix it and now I've revisited the solution so that it makes more sense and is now correct.

#16651 fixes 2 separate but related issues. One where there's a race condition between the snapshot creation and markdelete updates and another one where the wrong position is used for the snapshot position. The snapshot's position should be the position when the snapshot request was created since that's the point which is guaranteed to be in sync. Any further position could contain messages from other clusters etc.. Another problem that it solves is that a snapshot couldn't be used when the snapshot position was farther ahead that it actually should have been.

I revisited PR 16651 once again. The description and the changes contain the latest status of those changes. Besides the race condition issue fix, there's a fix for the wrong position issue. The issue was slightly different than what I thought before. The wrong position issue was about responding to the request with a response which uses the position where the message id is the latest message id in the topic. That didn't make sense to me and I changed it to use the position of the request marker message.

Dec 08 '25 18:12 lhotari

Codecov Report

:x: Patch coverage is 86.76471% with 18 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 74.36%. Comparing base (89f6015) to head (cb55b65). :warning: Report is 8 commits behind head on master.

Files with missing lines	Patch %	Lines
...ersistent/ReplicatedSubscriptionSnapshotCache.java	88.09%	11 Missing and 4 partials :warning:
.../persistent/ReplicatedSubscriptionsController.java	50.00%	3 Missing :warning:

Additional details and impacted files

@@              Coverage Diff              @@
##             master   #25044       +/-   ##
=============================================
+ Coverage     38.64%   74.36%   +35.71%     
- Complexity    13341    33776    +20435     
=============================================
  Files          1864     1921       +57     
  Lines        146223   150531     +4308     
  Branches      16984    17492      +508     
=============================================
+ Hits          56506   111940    +55434     
+ Misses        82051    29663    -52388     
- Partials       7666     8928     +1262

Flag	Coverage Δ
inttests	`26.31% <0.00%> (-0.13%)`	:arrow_down:
systests	`22.84% <0.00%> (-0.06%)`	:arrow_down:
unittests	`73.90% <86.76%> (+39.07%)`	:arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...org/apache/pulsar/broker/ServiceConfiguration.java	`98.22% <ø> (+3.05%)`	:arrow_up:
...ker/service/persistent/PersistentSubscription.java	`76.50% <100.00%> (+28.24%)`	:arrow_up:
.../persistent/ReplicatedSubscriptionsController.java	`74.35% <50.00%> (+74.35%)`	:arrow_up:
...ersistent/ReplicatedSubscriptionSnapshotCache.java	`84.17% <88.09%> (+84.17%)`	:arrow_up:

... and 1418 files with indirect coverage changes

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
:package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Dec 09 '25 18:12 codecov-commenter