pinot
pinot copied to clipboard
Reduce Heap Usage of OnHeapStringDictionary
Our OnHeapStringDictionary
implementation can result in a lot of wasted heap usage if there are enough duplicates in a column.
Below is JXray analysis of the heapdump for one usecase in Linkedin where the OnHeapStringDictionary
uses about 13GB of heap
String Interning described in https://www.baeldung.com/string/intern solves this problem. However, there could be certain high-cardinality columns (even with enough duplicates) where interning can be counter productive. So we can solve this with a fixed size interner as described in the following article https://dzone.com/articles/duplicate-strings-how-to-get-rid-of-them-and-save.
I attempted to PoC this change on one of our usecases and observed that the we saw huge savings in heap usage. Below is the heapdump analysis with my PoC change. Note that I used a size of 32M for the fixed size interner.
I'm planning to expose a new tableIndexConfig called onHeapDictionaryConfig
that will allow us to enable interning and control the size of the Fixed Size interner.