pulsar
pulsar copied to clipboard
[fix][broker] Fix one potential to add duplicated consumer
Fixes #20576
Motivation
There is a case resulted in duplicated consumer, if we add the same consumer twice continuously, then close consumer, the offline consumer would remain in consumerList, because AbstractDispatcherMultipleConsumers#consumerList is arrayList and AbstractDispatcherMultipleConsumers#consumerSet is set.
There is a concurrent problem in ServerCnx#handleSubscribe, which cause this case. Suppose client request handleSubscribe() three times and request handleCloseConsumer(). The execute order is as follow. since the order is addConsumer->addConsumer->close, but not addConsumer->close->addConsumer, duplicated consumer problem occur.
Modifications
- remove the step 9 "consumersLongHashmap.remove consumerFuture3".
- Because if judge existingConsumerFuture is not null, it means there is an earlier request try to handleSubscribe. the remove operation should be executed in the earlier request.
- add a test to verify this concurrent problem and the fix
Verifying this change
- [x] Make sure that the change passes the CI checks.
Documentation
- [ ]
doc
- [ ]
doc-required
- [x]
doc-not-needed
- [ ]
doc-complete
Matching PR in forked repository
PR in forked repository: https://github.com/TakaHiR07/pulsar/pull/10
You say in the linked issue and you have seen this twice. Are you able to reproduce it manually and verify that with this fix the problem is really fixed?
You say in the linked issue and you have seen this twice. Are you able to reproduce it manually and verify that with this fix the problem is really fixed?
I guess this is one of the possibilities for duplicated consumer. And this pr is corresponding to our production case. I can't affirm the duplicated consumer problem do not occur anymore.
@poorbarcode @eolivelli @lhotari @congbobo184 Can you help review this pr? It is another possibilities besides https://github.com/apache/pulsar/pull/15051
Add a test to verify this concurrent case and the fix. PTAL, thanks. @codelipenghui @eolivelli @lhotari @poorbarcode
The pr had no activity for 30 days, mark with Stale label.
I had the same problem on flink task and got fixed after trying this PR.
@poorbarcode @codelipenghui It seems this problem is not fixed in the master branch.
#22283 has fixed the same problem. Close this pr.