mars
mars copied to clipboard
[BUG] mars shuffle function not well-distributed
Describe the bug
Groupby shuffle keys for different groups are not well-distributed. In a online case which has 10000_0000 lines and chunk size is 20_0000, some gorups has about 24000 keys, but most groups has less than 5000 keys. The overall pecess is dominated by large keys group, and the execution is 5 times slower than expected.

To Reproduce To help us reproducing this bug, please provide information below:
- Your Python version: 3.7
- The version of Mars you use: master
- Versions of crucial packages, such as numpy, scipy and pandas
- Full stack of the error.
- Minimized code to reproduce the error.
Expected behavior The keys should be well-distributed. This is not a data skew. For data skew, some key groups will have much more data thant other group, but the issue is that some chunks has much more keys than other chunks.