[Bug] Thread safety issues with topic policy mutations
Search before asking
- [X] I searched in the issues and found nothing similar.
Version
Since 2.11
Minimal reproduce step
Topic policies are stored in system topics and stored as PulsarEvent instances which are serialized using Avro. The PulsarEvent can contain a TopicPoliciesEvent when it's eventType is TOPIC_POLICY. The TopicPoliciesEvent contains the TopicPolicies instance.
The problem is that TopicPolicies instances are cached and these cached instances are mutable in a non-thread safe way.
What did you expect to see?
TopicPolicies handling should be thread safe.
What did you see instead?
Thread safety concerns aren't covered in TopicPolicies.
Anything else?
There was a similar issue in mutating namespace policies, #9711. That was fixed with an approach where the shared instances are never mutated. The shared/cached instance is cloned and the clone is mutated.
Are you willing to submit a PR?
- [ ] I'm willing to submit a PR!
@mattisonchao Since you worked on #21231 recently, would you be interested in checking this one out? I created this issue purely from the perspective of code analysis. The topic policy mutations aren't thread safe and that could lead to other consistency issues. Similar thread safety problems were fixed in the namespace policies (issue #9711). For example #9900 by @315157973 is one of the PRs where this problem was fixed. The fix was essentially about not modifying the shared (via cache) instance at all.
@lhotari Yep, I am working on it.
The issue had no activity for 30 days, mark with Stale label.
cc @poorbarcode
You can drill down to the Namespace Policy update from org.apache.pulsar.broker.admin.impl.NamespacesBase#updatePoliciesAsync . That uses org.apache.pulsar.metadata.api.MetadataCache#readModifyUpdate under the covers. For TopicPolicies, it would be different since Metadata Store isn't used.
There will always be possibilities for race conditions if topic mutations happen on any broker. The thread safety issues mentioned in the issue description might be just one part of the problem.