incubator-gluten icon indicating copy to clipboard operation
incubator-gluten copied to clipboard

[VL] Rss use row-based sort

Open marin-ma opened this issue 1 year ago • 13 comments

marin-ma avatar Aug 01 '24 06:08 marin-ma

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

github-actions[bot] avatar Aug 01 '24 06:08 github-actions[bot]

Run Gluten Clickhouse CI

github-actions[bot] avatar Aug 01 '24 06:08 github-actions[bot]

Run Gluten Clickhouse CI

github-actions[bot] avatar Aug 01 '24 09:08 github-actions[bot]

Run Gluten Clickhouse CI

github-actions[bot] avatar Aug 06 '24 07:08 github-actions[bot]

Run Gluten Clickhouse CI

github-actions[bot] avatar Aug 28 '24 07:08 github-actions[bot]

Run Gluten Clickhouse CI

github-actions[bot] avatar Aug 28 '24 07:08 github-actions[bot]

Run Gluten Clickhouse CI

github-actions[bot] avatar Aug 28 '24 07:08 github-actions[bot]

@kerwin-zk Could you help to check whether this patch could work? Besides, I noticed it requires some extra configurations to be set for celeborn. Here's what I found and please help to check if there are anything else needed to fully enable sort-based shuffle with celeborn. Thanks!

--conf spark.shuffle.manager=celeborn \
--conf spark.celeborn.client.spark.shuffle.writer=sort \
--conf spark.celeborn.client.shuffle.compression.codec=zstd

cc: @FelixYBW

marin-ma avatar Aug 28 '24 08:08 marin-ma

@kerwin-zk Could you help to check whether this patch could work? Besides, I noticed it requires some extra configurations to be set for celeborn. Here's what I found and please help to check if there are anything else needed to fully enable sort-based shuffle with celeborn. Thanks!

--conf spark.shuffle.manager=celeborn \
--conf spark.celeborn.client.spark.shuffle.writer=sort \
--conf spark.celeborn.client.shuffle.compression.codec=zstd

cc: @FelixYBW

@marin-ma I'll try to test it first.

kerwin-zk avatar Aug 29 '24 02:08 kerwin-zk

@marin-ma I'll try to test it first.

Is there any more config? I'm testing it in a customer case.

FelixYBW avatar Aug 29 '24 05:08 FelixYBW

--conf spark.shuffle.manager=celeborn
--conf spark.celeborn.client.spark.shuffle.writer=sort

@FelixYBW @marin-ma I haven't tested row-based sort + Celeborn yet. For columnar-based sort + Celeborn, the following settings are needed:

spark.shuffle.manager: org.apache.spark.shuffle.gluten.celeborn.CelebornShuffleManager
spark.celeborn.master.endpoints: master-1-1:9097
spark.shuffle.service.enabled: false
spark.celeborn.client.spark.push.sort.memory.threshold: 128m
spark.celeborn.client.spark.shuffle.writer: sort
spark.celeborn.client.shuffle.compression.codec: none
spark.sql.adaptive.localShuffleReader.enabled: false

kerwin-zk avatar Aug 29 '24 05:08 kerwin-zk

@floesing_pins you may take a look of this. You may set spark.gluten.sql.columnar.shuffle.sort.partitions.threshold=4000

@jingyuanzhang_pins FYI.

FelixYBW avatar Aug 29 '24 19:08 FelixYBW

fiele isSort of VeloxCelebornColumnarShuffleWriter need update

clay4megtr avatar Oct 11 '24 17:10 clay4megtr

Run Gluten Clickhouse CI on x86

github-actions[bot] avatar Nov 13 '24 03:11 github-actions[bot]

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar Dec 29 '24 02:12 github-actions[bot]

This PR was auto-closed because it has been stalled for 10 days with no activity. Please feel free to reopen if it is still valid. Thanks.

github-actions[bot] avatar Jan 09 '25 02:01 github-actions[bot]

Run Gluten Clickhouse CI on x86

github-actions[bot] avatar Mar 18 '25 09:03 github-actions[bot]

Run Gluten Clickhouse CI on x86

github-actions[bot] avatar Mar 18 '25 14:03 github-actions[bot]

Run Gluten Clickhouse CI on x86

github-actions[bot] avatar Mar 18 '25 17:03 github-actions[bot]

cc @kerwin-zk

marin-ma avatar Mar 18 '25 20:03 marin-ma

@marin-ma Since our internal version depends on GLUTEN_RSS_SORT_SHUFFLE_WRITER, I suggest adding a configuration to control whether to use GLUTEN_SORT_SHUFFLE_WRITER or GLUTEN_RSS_SORT_SHUFFLE_WRITER.

kerwin-zk avatar Mar 19 '25 09:03 kerwin-zk

@marin-ma Since our internal version depends on GLUTEN_RSS_SORT_SHUFFLE_WRITER, I suggest adding a configuration to control whether to use GLUTEN_SORT_SHUFFLE_WRITER or GLUTEN_RSS_SORT_SHUFFLE_WRITER.

Is your internal version the same as upstream? We picked this up because one customer met OOM issue using upstream. Did you try GLUTEN_SORT_SHUFFLE_WRITER? Is there performance gap?

FelixYBW avatar Mar 19 '25 15:03 FelixYBW

@marin-ma Since our internal version depends on GLUTEN_RSS_SORT_SHUFFLE_WRITER, I suggest adding a configuration to control whether to use GLUTEN_SORT_SHUFFLE_WRITER or GLUTEN_RSS_SORT_SHUFFLE_WRITER.

Is your internal version the same as upstream? We picked this up because one customer met OOM issue using upstream. Did you try GLUTEN_SORT_SHUFFLE_WRITER? Is there performance gap?

@FelixYBW The internal version is somewhat different from the upstream version, and ShuffleRead may have memory issues(#9069). I previously tested GLUTEN_SORT_SHUFFLE_WRITER, and there were cases where it failed to process very large datasets. Therefore, I think it would be more appropriate to control whether to use GLUTEN_SORT_SHUFFLE_WRITER or GLUTEN_RSS_SORT_SHUFFLE_WRITER based on a parameter.

kerwin-zk avatar Mar 20 '25 02:03 kerwin-zk

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar May 05 '25 02:05 github-actions[bot]

@marin-ma Let's add a config to do the switch.

FelixYBW avatar May 05 '25 16:05 FelixYBW

@kerwin-zk The OOM issue is fixed https://github.com/apache/incubator-gluten/pull/9221

FelixYBW avatar May 05 '25 16:05 FelixYBW

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar Jun 20 '25 02:06 github-actions[bot]

This PR was auto-closed because it has been stalled for 10 days with no activity. Please feel free to reopen if it is still valid. Thanks.

github-actions[bot] avatar Jun 30 '25 02:06 github-actions[bot]