incubator-uniffle [Performance Optimization] The huge performance drop due to the method of getBlockIdsByPartitionId

Background

I found when getting shuffle result, the flame graph show the method of getBlockIdsByPartitionId occupy too much time. reliao_img_1658922962790

But i found the default bitmapNum is 1, and the conf of spark.rss.client.bitmap.splitNum looks abandoned in future.

Jul 27 '22 11:07 zuston

@colinmjj

Jul 28 '22 02:07 zuston

The performance problem of getBlockIdsByPartitionId is a known issue. With current design, blockId should be stored in shuffle server to support features like block filter, data validation, etc. Here is the challenge about how to store blockId in shuffle server, and we need consider how to reduce memory cost & have better performance. The current implementation is target to reduce memory cost in Shuffle Server, but have poor performance when get blockIds, otherwise, GC will be another problem. I think there should be a lot of works to improve this issue, maybe new design about how to do the data validation & how to store blockIds.

Jul 28 '22 02:07 colinmjj

Got it. If we have the better design on this, i think it will achieve better performance.

Jul 28 '22 03:07 zuston