datafusion
datafusion copied to clipboard
Convert `Utf8View`/`BinaryView` --> `Utf8` / `Binary` at output
Is your feature request related to a problem or challenge?
Part of https://github.com/apache/datafusion/issues/11752
We are trying to change DataFusion to use StringViewArray by default when reading parquet (and, for example, when it makes more sense such as the substr function), StringView enables many interesting optimization opportunities. However, as StringView is still being adopted across the rest of the arrow ecosystem, if DataFusion begins to emit StringViewArray in some places, it may cause issues with other parts of the ecosystem (e.g. flight clients may not be able to interpret data sent by a server using DataFusion)
Describe the solution you'd like
I would like DataFusion to retain maximum compatibility at the interfaces, but be able to use StringViewArray internally when it improves performance
Describe alternatives you've considered
I recommend a config flag that makes it possible to convert Utf8View/BinaryView --> Utf8 / Binary at the query output and I think this conversion should be done by default.
For example we might add this configuration flag:
datafusion.optimizer.expand_views_at_output=true
If this flag is true,
- add code in the Analyzer (maybe in the TypeCOercion code)
- check the output columns of a plan, and if any are
DataType::Utf8VieworDataType::BinaryView, add ProjectionExecthat converts them to Utf8/Binary (by adding a cast toDataType::Utf8orDataType::Binary` respectively
Additional context
We already have to do something similar in flight with dictionary arrays
I recommend a config flag that makes it possible to convert
Utf8View/BinaryView-->Utf8/Binaryat the query output and I think this conversion should be done by default.
Sounds reasonable.
One question -- do we need a flag though?
One question -- do we need a flag though?
My rationale is that converting from Utf8View --> Utf8 is not free. Thus users should have the option of not paying the cost if they want performance over compatibility.
I will admit I don't have a specific usecase in mind
BTW with this conversion in place, I think we could contemplate more interesting changes, like substr always outputing Utf8View even when the input was Utf8
My rationale is that converting from Utf8View --> Utf8 is not free.
it is not but transmitting non-compacted string views isn't free either.
also, we're transitioning from a state where DF didn't use string views (eager compaction) to a state where DF uses string views (deferred compaction). Thus, if we convert to non-view types on output, we do not risk regressing anything. And we can revisit later whether returning view types could be an improvement (and introduce a flag, if needed & worth it).
Thus, if we convert to non-view types on output, we do not risk regressing anything.
Right -- I agree so this is why in my mind having the default be "Utf8" is important
Thus, if we convert to non-view types on output, we do not risk regressing anything. And we can revisit later whether returning view types could be an improvement (and introduce a flag, if needed & worth it).
I would not be opposed to a PR that just hard codes the existing behavior (convert to Utf8 on output). I changed the title of this PR to reflect this.
I still think a config option offers the most flexibility but I do agree we could add it later if needed
@findepi -- are you already working on this ticket? If not, I would like to pick it up.
@wiedld awesome, go for it
take