[BUG] Spill occurs in GpuAggregate when GPU batch size reduces

Open sperlingxx opened this issue 1 month ago • 4 comments

Describe the bug When running a heavy GpuAggregate consisting of over 400 aggregate functions (including hundreds of comprehensive function stddev_pop ), significant amount of spill is observed in the map stage if using a relative small Gpu batch size (spark.rapids.sql.batchSizeBytes=512MB).

However, spill does not occur with larger Gpu batch size (spark.rapids.sql.batchSizeBytes=2048MB). Accordingly, the execution time is much more shorter:

Nov 26 '25 02:11 sperlingxx