Ken, Wang comments

Results 66 comments of


                                            Ken, Wang

Memory Limited GroupBy (Externalized / Spill)

> > Can we make the `GroupState` and the Accumulator states serializable ? With this approach, we do not need to do any sort when spiiling data to disks. And...

Support graceful shutdown for scheduler.

@liukun4515

Prune unneccessary data from task definition

I think there is prons and cons. The good part is that each partition/task will have a relatively small plan to deserialize, especially if the SQL include lots of UNION...

Add Accumulator API

If we just want cancel tasks early, protect systems from heavy queries/heavy scans, I think we do not need to introduce the Accumulator. Spark's Accumulator and Metrics system is very...

Shuffle files should get deleted immediately after job finishes by default

There is one exception, the finally query results are also some kind of shuffle files, we can not delete those shuffle files after the job finishes. I think we need...

Shuffle files should get deleted immediately after job finishes by default

> It might also be better to remove files after the next stage finishes instead of waiting on job to finish? Should help with disk consumption for very large jobs....

Scheduler infinite loop after failed/canceled job

Ah, this is a known issue, actually I added the failed job check in the pop_next_task() loop. Without such check, the scheduler loop will try to schedule the pending tasks...

Refactor config

@yahoNanJing

Optimize shuffle before coalesce

Could you please share me the SQL to reproduce the issue ?

Optimize shuffle before coalesce

Ok, if it is not a bug, I think maybe you can close the issue.