[Improvement] Introduce local allocation buffer to store blocks in memory
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Search before asking
- [X] I have searched in the issues and found no similar issues.
What would you like to be improved?
Currently we have put the shuffle data into the off-heap memory in shuffle server . But I found it still occupancy a lot of heap memory.
The following is the result of printing by using jmap -histo.
1: 189601376 16684921088 io.netty.buffer.UnpooledByteBufAllocator$InstrumentedUnpooledUnsafeDirectByteBuf
2: 189860728 15188858240 java.nio.DirectByteBuffer ([email protected])
3: 189605871 13651622712 jdk.internal.ref.Cleaner ([email protected])
4: 189018520 10585037120 org.apache.uniffle.common.ShufflePartitionedBlock
5: 189605871 7584234840 java.nio.DirectByteBuffer$Deallocator ([email protected])
From the above results, we can see that the main reason for high memory usage is that there are too many blocks. And the reason why there are so many blocks is because the blocks are very small.
How should we improve?
Introduce local allocation buffer like MSLAB in Hbase.
Refer: https://hbase.apache.org/book.html#gcpause
Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!
@jerqi @zuston @advancedxy @rickyma PTAL. I'm quite busy recently. If anyone interested in it, welcome to pick it up.
This issue seems feasible. I'll take a look first. We need this too.
Currently, there are a few things that we can do to make blocks smaller:
- Set
spark.rss.writer.buffer.spill.sizeto a higher value to make blocks larger, e.g.1gor2g. - Set
rss.client.memory.spill.ratioless than0.5, e.g.0.3, let larger blocks spill first. - Set
spark.rss.writer.buffer.sizeto a larger value refer to https://github.com/apache/incubator-uniffle/issues/1594#issuecomment-2081378887, e.g.10m.