incubator-gluten Gluten shuffle data size is twice that of vanilla Spark shuffle data size, with celeborn as remote shuffe service

Description

vanilla spark

gluten

shuffle from aggregate after data union

Gluten version

None

Nov 03 '25 06:11 lifulong

DId you try spark.gluten.sql.columnar.shuffle.celeborn.useRssSort=false ?

Nov 03 '25 20:11 FelixYBW

DId you try spark.gluten.sql.columnar.shuffle.celeborn.useRssSort=false ?

our gluten version is 1.4, has not this conf yet

Nov 05 '25 09:11 lifulong

Could we try this case with version 1.5.0? It looks like there’s a fix for issue https://github.com/apache/incubator-gluten/issues/9163 — could you check if it works for you?

Nov 05 '25 10:11 jackylee-ch

Could we try this case with version 1.5.0? It looks like there’s a fix for issue #9163 — could you check if it works for you?

our shuffle partitions conf is 2000, and spark.celeborn.client.spark.shuffle.writer use default conf hash shuffle, looks like not same issue, shuffle in this issue https://github.com/apache/incubator-gluten/issues/9163 is sort based shuffle

anyway, we will upgrade gluten version later, and i will try this job in new version.

Nov 05 '25 11:11 lifulong

i will test with set spark.celeborn.client.spark.shuffle.writer to rss_sort first

Nov 05 '25 11:11 lifulong

i will test with set spark.celeborn.client.spark.shuffle.writer to rss_sort first

set spark.celeborn.client.spark.shuffle.writer to sort, shuffle data is small a lot, performance optimize 10%, but still slow than vanilla spark

Nov 07 '25 09:11 lifulong

set spark.celeborn.client.spark.shuffle.writer to sort, shuffle data is small a lot, performance optimize 10%, but still slow than vanilla spark

It's observed by anohter customer. can you port the PR and test spark.gluten.sql.columnar.shuffle.celeborn.useRssSort=false

Nov 07 '25 09:11 FelixYBW

cc @kerwin-zk

Nov 10 '25 03:11 Yohahaha