datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

Support complex datatypes in Comet Scan

Open mattwparas opened this issue 1 year ago • 1 comments

What is the problem the feature request solves?

As of right now, only primitives are supported for parquet scan, and if any non primitives are detected in a sink node, comet will bail out of performing any transformations.

It would be great if Comet were able to handle relatively simple complex data types, like those supported for shuffle found here. Nested structs or maps from primitives to structs would also be helpful, but I'm not sure on the relative complexity past flat complex types.

Even more complex data types past this would also be helpful, but at a minimum supporting these would enable comet to perform optimizations on the current set of spark jobs that I'm working with.

Describe the potential solution

Comet is able to lower spark operations to native operations when the schema contains complex data types. As a start, relatively complex data types such as those supported for shuffle would be great. This includes arrays of primitives, maps with primitives, and structs with primitives.

Additional context

To help guide the implementation, knowing what the difference is between a type being supported in parquet scan versus within shuffle would be helpful - at least understanding why certain types can be used in different operations at a high level.

mattwparas avatar May 15 '24 17:05 mattwparas

Actually Comet columnar shuffle already supports some complex data types. You can find some tests using complex types in Comet shuffle test suites.

But Comet scan operator doesn't support complex types now. So you cannot read data of complex types from Parquet and do native operations on it. I think currently we also don't add any native expression which can produce output of complex types.

viirya avatar May 15 '24 17:05 viirya

@mattwparas We are now actively working on supporting reading complex types from Parquet. We have this partially working in main behind feature flags, and the epic to track this work is https://github.com/apache/datafusion-comet/issues/1043

andygrove avatar Jan 31 '25 16:01 andygrove

awesome, thank you for the update!

mattwparas avatar Feb 01 '25 17:02 mattwparas