delta
delta copied to clipboard
OPTIMIZE jobs aren't fully parallelized until the end of the execution
OPTIMIZE jobs aren't fully parallelized until the end of the execution
When running the OPTIMIZE command I noticed that the execution starts with a number of jobs equal with optimize.maxThreads but this number decreases until a single job will be running at a time.
I don't know if it is something I can change on my side to make it use all the cores and run all the expected jobs until the end.
I tried this with various spark configurations ( cores.max = 30/100/400 and driver-cores = 10/20, and it happens every time - when allocating 100/400 cores it doesn't drop to 1 but still drops somewhere around ~10)
- Delta Lake version: 1.2.1
- Spark version: 3.2.1
- Scala version: 2.12.15
I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
I've noticed this as well. I think this is due to using ParVector to parallelize the jobs. It seems like it splits the jobs into n/maxThreads groups and then executes, so if one of those groups finishes it won't keep running other yet to be started jobs in other groups. I tried to fix as part of https://github.com/delta-io/delta/pull/979 but couldn't get deterministic testing behavior to work. Probably needs to be a queue based approach instead.
As a workaround I just set the maxThreads to be greater than spark.cores.max (2x for example) which leads to a wider distribution of the workload.. It's not so great but it has improved it a little bit.