dwarfs icon indicating copy to clipboard operation
dwarfs copied to clipboard

Feature request: Improve block management for uncompressed blocks to save memory and enhance deduplication

Open wychen opened this issue 1 year ago • 1 comments

I would like to propose optimizing block management for uncompressed blocks in DwarFS. As it currently stands, uncompressed blocks are treated the same way as compressed blocks, meaning they are still loaded into memory and read sequentially from the beginning of the block from disk. This approach can be inefficient, especially when there is frequent access to uncompressed blocks. By allowing random access to the block without reading everything before the segment we need, or even not loading the block into memory at all, we could potentially save a significant amount of private memory.

mmap() could potentially enable efficient random access to uncompressed blocks and possibly eliminate the need to manually load them into memory entirely.

This feature would also be beneficial for the mkdwarfs process. If uncompressed blocks do not occupy private memory, they would not need to be counted toward the --max-lookback-blocks (-B) quota. This approach could effectively enlarge the deduplication lookup window without increasing the memory footprint. This idea is orthogonal to the proposal in https://github.com/mhx/dwarfs/issues/138, and these two methods can be combined to further optimize the deduplication process. For uncompressed blocks, they can still extend with byte granularity since mmap() allows for cheap random access.

I hope this proposal makes sense and I look forward to hearing your thoughts on its feasibility.

wychen avatar Apr 27 '23 04:04 wychen

This is a great observation and for the first case, it's trivial to implement. I've got it working in a branch and will push the code once I've got a proper internet connection.

mhx avatar May 25 '23 13:05 mhx