delta icon indicating copy to clipboard operation
delta copied to clipboard

OPTIMIZE jobs aren't fully parallelized until the end of the execution

Open alexandrutofan opened this issue 3 years ago • 2 comments
trafficstars

OPTIMIZE jobs aren't fully parallelized until the end of the execution

When running the OPTIMIZE command I noticed that the execution starts with a number of jobs equal with optimize.maxThreads but this number decreases until a single job will be running at a time.

I don't know if it is something I can change on my side to make it use all the cores and run all the expected jobs until the end.

I tried this with various spark configurations ( cores.max = 30/100/400 and driver-cores = 10/20, and it happens every time - when allocating 100/400 cores it doesn't drop to 1 but still drops somewhere around ~10)

Screenshot 2022-06-21 at 16 44 10
  • Delta Lake version: 1.2.1
  • Spark version: 3.2.1
  • Scala version: 2.12.15

I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.

alexandrutofan avatar Jun 21 '22 13:06 alexandrutofan

I've noticed this as well. I think this is due to using ParVector to parallelize the jobs. It seems like it splits the jobs into n/maxThreads groups and then executes, so if one of those groups finishes it won't keep running other yet to be started jobs in other groups. I tried to fix as part of https://github.com/delta-io/delta/pull/979 but couldn't get deterministic testing behavior to work. Probably needs to be a queue based approach instead.

Kimahriman avatar Jun 21 '22 14:06 Kimahriman

As a workaround I just set the maxThreads to be greater than spark.cores.max (2x for example) which leads to a wider distribution of the workload.. It's not so great but it has improved it a little bit.

alexandrutofan avatar Jul 28 '22 06:07 alexandrutofan