datafusion
datafusion copied to clipboard
Support substrait serialization for `ScalarValue::Utf8View` and `ScalarValue::BinaryView`
Is your feature request related to a problem or challenge?
Part of https://github.com/apache/datafusion/issues/11752
We are trying to enable DataFusion to use StringViewArray by default. If we do that it means ScalarValue::Utf8View and ScalarValue::BinaryView will be more likely to be used in plans
Describe the solution you'd like
Thus we need to ensure ScalarValue::Utf8View and ScalarValue::BinaryView can be serialized using datafusion substrait
Describe alternatives you've considered
I recommend adding coverage for ScalarValue::Utf8View and `ScalarValue::BinaryView to the tests here
https://github.com/apache/datafusion/blob/b4069a65a9bb207370d382bdde93f1c98d69b9eb/datafusion/substrait/src/logical_plan/producer.rs#L2283-L2284
And then update the code to get the tests to pass
Additional context
No response
@XiangpengHao has a draft pr here https://github.com/apache/datafusion/pull/11898
I think @wiedld is planning to look at this soon -- maybe @Blizzara will have some ideas if we run into how to map the types to/from the substrait types
take
There is a long discussion over here about the type system in substrait. The summary outcome (also reflected in the spec here) is that we have logical type "string" and different variations for the physical type (e.g. uft8 vs largeutf8 vs utf8view).
I've passing tests using this approach; want to check code coverage on a few things before putting up the PR (a.k.a. it looks like the largeutf8 variation was not fully implementation everywhere -- so I want to make sure we have full test coverage).
There is a long discussion over here about the type system in substrait. The summary outcome (also reflected in the spec here) is that we have logical type "string" and different variations for the physical type (e.g. uft8 vs largeutf8 vs utf8view).
Yup, this aligns with my understanding - the PR looks good from cursory review!