hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] Introduce in-memory-cache for ExternalSpillableMap

Open TheR1sing3un opened this issue 1 year ago • 7 comments

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at [email protected].

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

Should we need a new ExternalSpillableMap which can cache the key-value in memory ? I find that current ExternalSpillableMap's in-memory-map will keep these keys which come early. So only the earliest keys can be cache-hitted when call get function, and those 'unlucky' keys will cause random read operations on the disk. So can we introduce a in-memory-map used for caching key-value pairs with different elimination strategies, ie, LRU, LFU... In my implementation, I would not break the existing implementation, but use a new type inherited from ExternalSpillableMap to implement the above functionality.

Looking forward to your thoughts! To Reproduce

Steps to reproduce the behavior:

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version :

  • Spark version :

  • Hive version :

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) :

  • Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

TheR1sing3un avatar Aug 01 '24 07:08 TheR1sing3un

So can we introduce a in-memory-map used for caching key-value pairs with different elimination strategies, ie, LRU, LFU... In my implementation, I

+1

danny0405 avatar Aug 01 '24 08:08 danny0405

So can we introduce a in-memory-map used for caching key-value pairs with different elimination strategies, ie, LRU, LFU... In my implementation, I

+1

Thanks for your response! If this solution is really useful, I am willing to try to contribute this piece of code.

TheR1sing3un avatar Aug 01 '24 08:08 TheR1sing3un

@TheR1sing3un hi, this is a great capability, but before starting, it is necessary to confirm in which scenarios it can bring benefits. Could you provide a few scenarios?

KnightChess avatar Aug 02 '24 02:08 KnightChess

@TheR1sing3un hi, this is a great capability, but before starting, it is necessary to confirm in which scenarios it can bring benefits. Could you provide a few scenarios?

Assume the following scenario

  • some high-frequency keys are updated frequently
  • the first appearance of these keys in log files came relatively late
  • MaxMemoryForCompaction relatively small

When a Compaction/Log-Compaction happens and these keys were not initially loaded into the memory map, the subsequent frequent 'get' in compaction logic results in a large number of random reads on the disk (both bitcask and rocksdb are log-structure-append-only-file data structure, high performance for writing but low performance for reading ), which will cause great performance degradation. If the map in the memory has a certain elimination and update mechanism, such as lru or lfu, the memory hit ratio will be improved in some scenarios for better compaction performance.

TheR1sing3un avatar Aug 02 '24 03:08 TheR1sing3un

@TheR1sing3un oh I see, the mor log file merge will benefit from it, thank you answer.

KnightChess avatar Aug 02 '24 03:08 KnightChess

@TheR1sing3un Were we able to work on this. I was not able to locate jira or PR, Created one JIRA for tracking - https://issues.apache.org/jira/browse/HUDI-8114

ad1happy2go avatar Aug 22 '24 09:08 ad1happy2go

@TheR1sing3un Were we able to work on this. I was not able to locate jira or PR, Created one JIRA for tracking - https://issues.apache.org/jira/browse/HUDI-8114

Thank u! I am working on it already, I will submit pr about it later, please assign it to me in JIRA! (I have leaved a comment on JIRA)

TheR1sing3un avatar Aug 22 '24 09:08 TheR1sing3un