incubator-gluten icon indicating copy to clipboard operation
incubator-gluten copied to clipboard

Gluten shuffle data size is twice that of vanilla Spark shuffle data size, with celeborn as remote shuffe service

Open lifulong opened this issue 5 months ago • 8 comments

Description

vanilla spark

Image Image

gluten

Image Image

shuffle from aggregate after data union

Image

Gluten version

None

lifulong avatar Nov 03 '25 06:11 lifulong

DId you try spark.gluten.sql.columnar.shuffle.celeborn.useRssSort=false ?

FelixYBW avatar Nov 03 '25 20:11 FelixYBW

DId you try spark.gluten.sql.columnar.shuffle.celeborn.useRssSort=false ?

our gluten version is 1.4, has not this conf yet

lifulong avatar Nov 05 '25 09:11 lifulong

Could we try this case with version 1.5.0? It looks like there’s a fix for issue https://github.com/apache/incubator-gluten/issues/9163 — could you check if it works for you?

jackylee-ch avatar Nov 05 '25 10:11 jackylee-ch

Could we try this case with version 1.5.0? It looks like there’s a fix for issue #9163 — could you check if it works for you?

Image our shuffle partitions conf is 2000, and spark.celeborn.client.spark.shuffle.writer use default conf hash shuffle, looks like not same issue, shuffle in this issue https://github.com/apache/incubator-gluten/issues/9163 is sort based shuffle

anyway, we will upgrade gluten version later, and i will try this job in new version.

lifulong avatar Nov 05 '25 11:11 lifulong

i will test with set spark.celeborn.client.spark.shuffle.writer to rss_sort first

lifulong avatar Nov 05 '25 11:11 lifulong

i will test with set spark.celeborn.client.spark.shuffle.writer to rss_sort first

Image set spark.celeborn.client.spark.shuffle.writer to sort, shuffle data is small a lot, performance optimize 10%, but still slow than vanilla spark

lifulong avatar Nov 07 '25 09:11 lifulong

set spark.celeborn.client.spark.shuffle.writer to sort, shuffle data is small a lot, performance optimize 10%, but still slow than vanilla spark

It's observed by anohter customer. can you port the PR and test spark.gluten.sql.columnar.shuffle.celeborn.useRssSort=false

FelixYBW avatar Nov 07 '25 09:11 FelixYBW

cc @kerwin-zk

Yohahaha avatar Nov 10 '25 03:11 Yohahaha