Ken, Wang
Ken, Wang
> > Can we make the `GroupState` and the Accumulator states serializable ? With this approach, we do not need to do any sort when spiiling data to disks. And...
@liukun4515
I think there is prons and cons. The good part is that each partition/task will have a relatively small plan to deserialize, especially if the SQL include lots of UNION...
If we just want cancel tasks early, protect systems from heavy queries/heavy scans, I think we do not need to introduce the Accumulator. Spark's Accumulator and Metrics system is very...
There is one exception, the finally query results are also some kind of shuffle files, we can not delete those shuffle files after the job finishes. I think we need...
> It might also be better to remove files after the next stage finishes instead of waiting on job to finish? Should help with disk consumption for very large jobs....
Ah, this is a known issue, actually I added the failed job check in the pop_next_task() loop. Without such check, the scheduler loop will try to schedule the pending tasks...
@yahoNanJing
Could you please share me the SQL to reproduce the issue ?
Ok, if it is not a bug, I think maybe you can close the issue.