datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

Implement native version of ColumnarToRow

Open andygrove opened this issue 1 year ago • 6 comments

What is the problem the feature request solves?

Spark's ColumnarToRowExec can be very slow in some cases. This plan shows that C2R took 8 minutes even though the underlying scan only took 20 seconds. This example is selecting decimal columns, so this is related to https://github.com/apache/datafusion-comet/issues/670 and https://github.com/apache/datafusion-comet/issues/679, but implementing a native version of ColumnarToRowExec may help with this.

image (3)

Describe the potential solution

No response

Additional context

No response

andygrove avatar Jul 23 '24 18:07 andygrove

I created a Google document to discuss the design.

https://docs.google.com/document/d/1zNuavf_WT3IcpeTVAEC8IjMGloi1MeAeQSg4F0eEivs/edit?usp=sharing

andygrove avatar Aug 15 '24 23:08 andygrove

I created https://github.com/apache/datafusion-comet/issues/837 for the very first step in this process

andygrove avatar Aug 16 '24 20:08 andygrove

Here is a like for like comparison between Spark and Comet for Scan+C2R for the optimized version of q72.

Spark

Spark C2R takes 1.1 minutes

Screenshot from 2024-08-22 05-15-34

Comet

Comet C2R takes 1.3 minutes

Screenshot from 2024-08-22 05-30-51

andygrove avatar Aug 22 '24 11:08 andygrove

There is some spill happening, even though I am allocating 20g memory overhead:

Screenshot from 2024-08-22 07-52-14

andygrove avatar Aug 22 '24 13:08 andygrove

@parthchandra I know that you are working on this so I tried assigning the issue to you, but your name does not show up for me for some reason. I wondered if pinging you here would change that.

edit: No, that did not help

andygrove avatar Sep 25 '24 17:09 andygrove

Posting a reply in case it helps associate the issue somehow. Anyhow, confirming that I am indeed working on this. In that context, I am initially planning to only add support for fixed width primitive types. Then add decimals and varchar. Estimating the sizes of complex types is somewhat complicated, so I will defer that until we see actual gains from this effort.

parthchandra avatar Sep 25 '24 17:09 parthchandra

I am closing this issue for now because I believe that we determined that this is no longer a priority. We can reopen the issue if this changes.

andygrove avatar Feb 21 '25 19:02 andygrove