KAFKA-19782: improve Authorizer by using (Patricia) trie
This PR address KAFKA-19782 by replacing hashset with trie for prefix search.
Previously the Authorizer uses character-by-character comparisons to search for prefix patterns to deny within a set of allowed literals. This prefix search is suboptimal when the set contains many strings that share the same prefix -- the complexity is linear in number of strings sharing a prefix. By constrast, a trie will reduce the complexity to constant.
In particular, this PR uses Apache Commons Collection's PATRICIA Trie, which is an efficient algorithm that offers worst-case O(k)-time operations, where k is the largest key bit-length.
The resulting code appears also clearer in logic.
Hi seems we make parallel changes :) on the same day https://github.com/apache/kafka/pull/20912
Maybe we should keep just one branch, and I'll merge the benchmarks into it and fix the bug with them?
Also take a look that this PR connected with big problem with benchmarks
- They are unstable and should be run and fixed also
- Error should be(oh.. it's the most dangerous zone) no more than 10 % percent in jmh.
I am stuck at this more than a month
Hi seems we make parallel changes :) on the same day #20912
Maybe we should keep just one branch, and I'll merge the benchmarks into it and fix the bug with them?
Small world :) Sure, happy to collaborate. I'll take a look at your benchmark changes. Once ready, are you able to push those benchmark changes to this branch? Otherwise I am also happy to merge your benchmark changes. Let me know!
@lucasbru Hi Lucas, could you help us out?
We accidentally created two PRs for the same task, and we’ll actually be working together on it going forward. Right now we have a question about which approach is better.
Could you coordinate us a bit and take a look at both PRs?
One PR introduces a PatriciaTrie-based Set on top of PatriciaTrie, - https://github.com/apache/kafka/pull/20911 (this pr) and the other one uses a PatriciaTrie Map. - https://github.com/apache/kafka/pull/20912
In your opinion, which option is better to use? Both approaches work.
By the way, I’ve already written to a colleague from the library Apache common collections and suggested adding a PatriciaTrie Set, but he’s very busy and hasn’t responded yet - https://github.com/apache/commons-collections/pull/589
A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.
A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.
A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.