datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Support substrait serialization for `ScalarValue::Utf8View` and `ScalarValue::BinaryView`

Open alamb opened this issue 1 year ago • 5 comments

Is your feature request related to a problem or challenge?

Part of https://github.com/apache/datafusion/issues/11752

We are trying to enable DataFusion to use StringViewArray by default. If we do that it means ScalarValue::Utf8View and ScalarValue::BinaryView will be more likely to be used in plans

Describe the solution you'd like

Thus we need to ensure ScalarValue::Utf8View and ScalarValue::BinaryView can be serialized using datafusion substrait

Describe alternatives you've considered

I recommend adding coverage for ScalarValue::Utf8View and `ScalarValue::BinaryView to the tests here

https://github.com/apache/datafusion/blob/b4069a65a9bb207370d382bdde93f1c98d69b9eb/datafusion/substrait/src/logical_plan/producer.rs#L2283-L2284

And then update the code to get the tests to pass

Additional context

No response

alamb avatar Aug 22 '24 18:08 alamb

@XiangpengHao has a draft pr here https://github.com/apache/datafusion/pull/11898

alamb avatar Aug 22 '24 18:08 alamb

I think @wiedld is planning to look at this soon -- maybe @Blizzara will have some ideas if we run into how to map the types to/from the substrait types

alamb avatar Aug 23 '24 17:08 alamb

take

wiedld avatar Aug 23 '24 18:08 wiedld

There is a long discussion over here about the type system in substrait. The summary outcome (also reflected in the spec here) is that we have logical type "string" and different variations for the physical type (e.g. uft8 vs largeutf8 vs utf8view).

I've passing tests using this approach; want to check code coverage on a few things before putting up the PR (a.k.a. it looks like the largeutf8 variation was not fully implementation everywhere -- so I want to make sure we have full test coverage).

wiedld avatar Aug 27 '24 01:08 wiedld

There is a long discussion over here about the type system in substrait. The summary outcome (also reflected in the spec here) is that we have logical type "string" and different variations for the physical type (e.g. uft8 vs largeutf8 vs utf8view).

Yup, this aligns with my understanding - the PR looks good from cursory review!

Blizzara avatar Aug 28 '24 13:08 Blizzara