datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

`spark.comet.memory.overhead.min` not respected when submitting jobs with Comet with Spark on Kubernetes

Open lmouhib opened this issue 1 year ago • 9 comments

Describe the bug

Currently when submitting a job on kubernetes, the total memory of the driver or executor is the sum of the memory defined in the spark configuration and the overhead (spark.{executor|driver}.memoryOverhead+ spark.{executor|driver}.memory). This does not included the default values defined by spark.comet.memory.overhead.factor.

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

lmouhib avatar Jun 28 '24 15:06 lmouhib

Thats true, thanks for raising this. Spark Kubernetes pods respect Spark params but has no any idea about Comet params for now. Once pod memory allocated its not possible to change it. Since Comet has no control over Spark params, I'm inclining into including spark.comet.memory.overhead.factor into spark memoryOverhead but open to other propositions

comphead avatar Jun 28 '24 16:06 comphead

I am inclined to including it to the spark overhead, however, I am not sure to understand why its not taken into consideration here. That line of code should provide the new value of the memory overhead, which is then passed to spark and the RM to create the pod template that gets applied to create the pod.

lmouhib avatar Jun 28 '24 17:06 lmouhib

I think what might happen is the executor pod already started by the time Comet tries to tweak the memory and once the pod is up, its not possible to change the allocated memory size. What Comet does is correct, but I have some feeling it should have done earlier

comphead avatar Jun 28 '24 20:06 comphead

If CometDriverPlugin is loaded, it should overwrite executor overhead by existing executor overhead + Comet overhead.

viirya avatar Jun 29 '24 00:06 viirya

We need to document the CometDriverPlugin, its not something that is mentioned int he configurations or any page of the documentation.

I did set the spark.plugins to org.apache.spark.CometPlugin, and still did not observe any change in the executor container memory configuarion.

lmouhib avatar Jun 30 '24 19:06 lmouhib

I'll run some tests on this today

comphead avatar Jul 01 '24 15:07 comphead

How is the documentation for the configuration generated? The only mention there is of the plugin is in the code base itself. I can open a PR to add the conf, if its done in the documentation itself and not auto generated.

lmouhib avatar Jul 01 '24 17:07 lmouhib

Depends on https://github.com/apache/datafusion-comet/issues/643

comphead avatar Jul 08 '24 18:07 comphead

Depends on #689

comphead avatar Jul 19 '24 18:07 comphead