kafka icon indicating copy to clipboard operation
kafka copied to clipboard

KAFKA-19782: improve Authorizer by using (Patricia) trie

Open zheguang opened this issue 1 month ago • 7 comments

This PR address KAFKA-19782 by replacing hashset with trie for prefix search.

Previously the Authorizer uses character-by-character comparisons to search for prefix patterns to deny within a set of allowed literals. This prefix search is suboptimal when the set contains many strings that share the same prefix -- the complexity is linear in number of strings sharing a prefix. By constrast, a trie will reduce the complexity to constant.

In particular, this PR uses Apache Commons Collection's PATRICIA Trie, which is an efficient algorithm that offers worst-case O(k)-time operations, where k is the largest key bit-length.

The resulting code appears also clearer in logic.

zheguang avatar Nov 18 '25 09:11 zheguang

Hi seems we make parallel changes :) on the same day https://github.com/apache/kafka/pull/20912

Maybe we should keep just one branch, and I'll merge the benchmarks into it and fix the bug with them?

ekuvardin avatar Nov 18 '25 10:11 ekuvardin

Also take a look that this PR connected with big problem with benchmarks

  1. They are unstable and should be run and fixed also
  2. Error should be(oh.. it's the most dangerous zone) no more than 10 % percent in jmh.

I am stuck at this more than a month

ekuvardin avatar Nov 18 '25 11:11 ekuvardin

Hi seems we make parallel changes :) on the same day #20912

Maybe we should keep just one branch, and I'll merge the benchmarks into it and fix the bug with them?

Small world :) Sure, happy to collaborate. I'll take a look at your benchmark changes. Once ready, are you able to push those benchmark changes to this branch? Otherwise I am also happy to merge your benchmark changes. Let me know!

zheguang avatar Nov 18 '25 20:11 zheguang

@lucasbru Hi Lucas, could you help us out?

We accidentally created two PRs for the same task, and we’ll actually be working together on it going forward. Right now we have a question about which approach is better.

Could you coordinate us a bit and take a look at both PRs?

One PR introduces a PatriciaTrie-based Set on top of PatriciaTrie, - https://github.com/apache/kafka/pull/20911 (this pr) and the other one uses a PatriciaTrie Map. - https://github.com/apache/kafka/pull/20912

In your opinion, which option is better to use? Both approaches work.

By the way, I’ve already written to a colleague from the library Apache common collections and suggested adding a PatriciaTrie Set, but he’s very busy and hasn’t responded yet - https://github.com/apache/commons-collections/pull/589

ekuvardin avatar Nov 19 '25 10:11 ekuvardin

A label of 'needs-attention' was automatically added to this PR in order to raise the attention of the committers. Once this issue has been triaged, the triage label should be removed to prevent this automation from happening again.

github-actions[bot] avatar Nov 26 '25 03:11 github-actions[bot]

A label of 'needs-attention' was automatically added to this PR in order to raise the attention of the committers. Once this issue has been triaged, the triage label should be removed to prevent this automation from happening again.

github-actions[bot] avatar Nov 29 '25 03:11 github-actions[bot]

A label of 'needs-attention' was automatically added to this PR in order to raise the attention of the committers. Once this issue has been triaged, the triage label should be removed to prevent this automation from happening again.

github-actions[bot] avatar Dec 09 '25 03:12 github-actions[bot]