Consider adding coalesce logic to GPU writes
Currently the GPU writes do not coalesce incoming batches in most cases, and these batches are sent down to the chunked writers. Each chunked write currently creates a separate row group or stripe, so many tiny batches ends up translating into tiny row groups which are inefficient to store.
It may be beneficial to coalesce incoming batches to write before sending these down to the chunked writers. Note that writers often have significant memory usage on the GPU, so we don't want to get too carried away with building large batches to write.
Ran into some very bad performance due to the lack of coalesce before write. Granted this is a contrived case I was using to test something else, but this shows that it can occur. Example to reproduce:
spark.range(100000).repartition(20000).coalesce(1).write.mode("overwrite").parquet("/tmp/output.parquet")