How can Comet be enabled by default without needing to configure memory?
What is the problem the feature request solves?
One of the challenges in having users switch to Comet is that it is typically necessary to specify how much memory Comet can use. Memory tuning will be explained in more detail in the Tuning Guide once https://github.com/apache/datafusion-comet/pull/1525 is merged.
It would be nice if there were a way that Comet could be enabled by default and choose sensible memory defaults.
Currently, when running in on-heap mode, we allocate an additional 20% of executor memory to Comet, but Comet may need the same amount of memory as Spark in some cases. We could change the default to 100%, but then we are doubling the amount of memory, and this may not be ideal.
To summarize, if we allocate too little memory then jobs may fail with OOM and we allocate too much it is wasteful and may prevent jobs from running due to lack of resource in a cluster.
NVIDIA faces the same challenge with Spark RAPIDS and just announced Project Aether to solve this problem for their customers.
Let's use this issue to discuss ideas for an approach for Comet.
Describe the potential solution
No response
Additional context
No response
We could change the default to 100%, but then we are doubling the amount of memory, and this may not be ideal.
When enabling off-heap memory, we will use unified memory manager, does that mean the amount of memory will not be doubled?
When enabling off-heap memory, we will use unified memory manager, does that mean the amount of memory will not be doubled?
Our current implementation automatically allocates extra memory when running Spark in on-heap mode if the user does not explicitly specify spark.comet.memoryOverhead. We do not do anything like this for Spark off-heap mode and require the user to explicitly set the amount of memory to be shared between Spark and Comet by setting spark.memory.offHeap.size.
My opinion is that auto-tuning Spark configuration (with or without Comet) requires running the jobs with different configs and learning from the impact. I don't think that we can simply pick a magic multiplier that works in every case.
Here is Microsoft's solution for reference:
https://learn.microsoft.com/en-us/fabric/data-engineering/autotune?tabs=sparksql