hudi
hudi copied to clipboard
[SUPPORT] Introduce in-memory-cache for ExternalSpillableMap
Tips before filing an issue
-
Have you gone through our FAQs?
-
Join the mailing list to engage in conversations and get faster support at [email protected].
-
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
Should we need a new ExternalSpillableMap which can cache the key-value in memory ?
I find that current ExternalSpillableMap's in-memory-map will keep these keys which come early. So only the earliest keys can be cache-hitted when call get function, and those 'unlucky' keys will cause random read operations on the disk.
So can we introduce a in-memory-map used for caching key-value pairs with different elimination strategies, ie, LRU, LFU...
In my implementation, I would not break the existing implementation, but use a new type inherited from ExternalSpillableMap to implement the above functionality.
Looking forward to your thoughts! To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Environment Description
-
Hudi version :
-
Spark version :
-
Hive version :
-
Hadoop version :
-
Storage (HDFS/S3/GCS..) :
-
Running on Docker? (yes/no) :
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.
So can we introduce a in-memory-map used for caching key-value pairs with different elimination strategies, ie, LRU, LFU... In my implementation, I
+1
So can we introduce a in-memory-map used for caching key-value pairs with different elimination strategies, ie, LRU, LFU... In my implementation, I
+1
Thanks for your response! If this solution is really useful, I am willing to try to contribute this piece of code.
@TheR1sing3un hi, this is a great capability, but before starting, it is necessary to confirm in which scenarios it can bring benefits. Could you provide a few scenarios?
@TheR1sing3un hi, this is a great capability, but before starting, it is necessary to confirm in which scenarios it can bring benefits. Could you provide a few scenarios?
Assume the following scenario
- some high-frequency keys are updated frequently
- the first appearance of these keys in log files came relatively late
- MaxMemoryForCompaction relatively small
When a Compaction/Log-Compaction happens and these keys were not initially loaded into the memory map, the subsequent frequent 'get' in compaction logic results in a large number of random reads on the disk (both bitcask and rocksdb are log-structure-append-only-file data structure, high performance for writing but low performance for reading ), which will cause great performance degradation. If the map in the memory has a certain elimination and update mechanism, such as lru or lfu, the memory hit ratio will be improved in some scenarios for better compaction performance.
@TheR1sing3un oh I see, the mor log file merge will benefit from it, thank you answer.
@TheR1sing3un Were we able to work on this. I was not able to locate jira or PR, Created one JIRA for tracking - https://issues.apache.org/jira/browse/HUDI-8114
@TheR1sing3un Were we able to work on this. I was not able to locate jira or PR, Created one JIRA for tracking - https://issues.apache.org/jira/browse/HUDI-8114
Thank u! I am working on it already, I will submit pr about it later, please assign it to me in JIRA! (I have leaved a comment on JIRA)