incubator-uniffle icon indicating copy to clipboard operation
incubator-uniffle copied to clipboard

[Performance Optimization] The huge performance drop due to the method of getBlockIdsByPartitionId

Open zuston opened this issue 3 years ago • 3 comments

Background

I found when getting shuffle result, the flame graph show the method of getBlockIdsByPartitionId occupy too much time. reliao_img_1658922962790

But i found the default bitmapNum is 1, and the conf of spark.rss.client.bitmap.splitNum looks abandoned in future.

zuston avatar Jul 27 '22 11:07 zuston

@colinmjj

zuston avatar Jul 28 '22 02:07 zuston

The performance problem of getBlockIdsByPartitionId is a known issue. With current design, blockId should be stored in shuffle server to support features like block filter, data validation, etc. Here is the challenge about how to store blockId in shuffle server, and we need consider how to reduce memory cost & have better performance. The current implementation is target to reduce memory cost in Shuffle Server, but have poor performance when get blockIds, otherwise, GC will be another problem. I think there should be a lot of works to improve this issue, maybe new design about how to do the data validation & how to store blockIds.

colinmjj avatar Jul 28 '22 02:07 colinmjj

Got it. If we have the better design on this, i think it will achieve better performance.

zuston avatar Jul 28 '22 03:07 zuston