pinot icon indicating copy to clipboard operation
pinot copied to clipboard

Reduce Heap Usage of OnHeapStringDictionary

Open vvivekiyer opened this issue 1 year ago • 14 comments

Our OnHeapStringDictionary implementation can result in a lot of wasted heap usage if there are enough duplicates in a column.

Below is JXray analysis of the heapdump for one usecase in Linkedin where the OnHeapStringDictionary uses about 13GB of heap image

String Interning described in https://www.baeldung.com/string/intern solves this problem. However, there could be certain high-cardinality columns (even with enough duplicates) where interning can be counter productive. So we can solve this with a fixed size interner as described in the following article https://dzone.com/articles/duplicate-strings-how-to-get-rid-of-them-and-save.

I attempted to PoC this change on one of our usecases and observed that the we saw huge savings in heap usage. Below is the heapdump analysis with my PoC change. Note that I used a size of 32M for the fixed size interner. image

I'm planning to expose a new tableIndexConfig called onHeapDictionaryConfig that will allow us to enable interning and control the size of the Fixed Size interner.

vvivekiyer avatar Dec 01 '23 00:12 vvivekiyer