datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Consolidate datafusion, arrow-cpp, and substrait's handling of non-substrait arrow types

Open westonpace opened this issue 1 year ago • 2 comments

Is your feature request related to a problem or challenge?

I am trying to take a filter expression created by pyarrow and convert it into a filter expression for Datafusion to satisfy. I am using Substrait to do this. Everything works fine when I use the standard Substrait types. However, when I use normal Arrow types that are not Substrait types (e.g. unsigned integers, large containers) I run into problems.

It seems that arrow-cpp (admittedly, me, in this case) and datafusion have taken different approaches to handling these limitations.

In arrow-cpp the types that expand or change the valid range of values (e.g. unsigned integers, large containers) are converted to extension types. This process is documented in https://github.com/apache/arrow/blob/main/format/substrait/extension_types.yaml

In datafusion it appears these types are expected to use the nearest substrait match (e.g. signed integer, small container) with a type variation.

Describe the solution you'd like

I am admittedly biased (given I implemented one of the two disagreeing components) but I favor the extension types approach. Type variations are defined in Substrait as this:

Type variations may be used to represent differences in representation between different consumers. For example, an engine might support dictionary encoding for a string, or could be using either a row-wise or columnar representation of a struct. All variations of a type are expected to have the same semantics when operated on by functions or other expressions.

Given that definition, I do not think it is valid to say that an unsigned integer is a variation of a signed integer (they do not have the same outputs for all functions). I do believe things like the view types and dictionary encoding are valid type variations.

Describe alternatives you've considered

The alternative would be to change arrow-cpp to also use type variations. Though I'd like some consensus from the Substrait community that this is a valid use of type variations before taking that approach.

At the moment I am working around this issue by simply removing any non-standard types from the input schema (this works as long as the filter isn't referencing those types).

Additional context

No response

westonpace avatar Aug 26 '24 19:08 westonpace

cc @Blizzara and @waynexia

alamb avatar Aug 27 '24 13:08 alamb

I haven't thought much about these, as they're not relevant for our project. That said the variations have always felt a bit off to me, and given the cited paragraph from Substrait, moving from variations into extension types sounds like it could be right approach.

If we go that way, I wonder if it'd make sense to define those extension types somewhere centrally (in substrait or arrow repo?) rather than in DF, to help with interop for at least Arrow-based systems.

Blizzara avatar Aug 27 '24 15:08 Blizzara