mars icon indicating copy to clipboard operation
mars copied to clipboard

[BUG] mars shuffle function not well-distributed

Open chaokunyang opened this issue 3 years ago • 0 comments

Describe the bug Groupby shuffle keys for different groups are not well-distributed. In a online case which has 10000_0000 lines and chunk size is 20_0000, some gorups has about 24000 keys, but most groups has less than 5000 keys. The overall pecess is dominated by large keys group, and the execution is 5 times slower than expected. image image

To Reproduce To help us reproducing this bug, please provide information below:

  1. Your Python version: 3.7
  2. The version of Mars you use: master
  3. Versions of crucial packages, such as numpy, scipy and pandas
  4. Full stack of the error.
  5. Minimized code to reproduce the error.

Expected behavior The keys should be well-distributed. This is not a data skew. For data skew, some key groups will have much more data thant other group, but the issue is that some chunks has much more keys than other chunks.

chaokunyang avatar Apr 19 '22 07:04 chaokunyang