datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Convert `Utf8View`/`BinaryView` --> `Utf8` / `Binary` at output

Open alamb opened this issue 1 year ago • 8 comments

Is your feature request related to a problem or challenge?

Part of https://github.com/apache/datafusion/issues/11752

We are trying to change DataFusion to use StringViewArray by default when reading parquet (and, for example, when it makes more sense such as the substr function), StringView enables many interesting optimization opportunities. However, as StringView is still being adopted across the rest of the arrow ecosystem, if DataFusion begins to emit StringViewArray in some places, it may cause issues with other parts of the ecosystem (e.g. flight clients may not be able to interpret data sent by a server using DataFusion)

Describe the solution you'd like

I would like DataFusion to retain maximum compatibility at the interfaces, but be able to use StringViewArray internally when it improves performance

Describe alternatives you've considered

I recommend a config flag that makes it possible to convert Utf8View/BinaryView --> Utf8 / Binary at the query output and I think this conversion should be done by default.

For example we might add this configuration flag:

datafusion.optimizer.expand_views_at_output=true

If this flag is true,

  1. add code in the Analyzer (maybe in the TypeCOercion code)
  2. check the output columns of a plan, and if any are DataType::Utf8View or DataType::BinaryView, add ProjectionExecthat converts them to Utf8/Binary (by adding a cast toDataType::Utf8orDataType::Binary` respectively

Additional context

We already have to do something similar in flight with dictionary arrays

alamb avatar Aug 22 '24 18:08 alamb

I recommend a config flag that makes it possible to convert Utf8View/BinaryView --> Utf8 / Binary at the query output and I think this conversion should be done by default.

Sounds reasonable.

One question -- do we need a flag though?

findepi avatar Aug 23 '24 20:08 findepi

One question -- do we need a flag though?

My rationale is that converting from Utf8View --> Utf8 is not free. Thus users should have the option of not paying the cost if they want performance over compatibility.

I will admit I don't have a specific usecase in mind

alamb avatar Aug 24 '24 09:08 alamb

BTW with this conversion in place, I think we could contemplate more interesting changes, like substr always outputing Utf8View even when the input was Utf8

alamb avatar Aug 24 '24 09:08 alamb

My rationale is that converting from Utf8View --> Utf8 is not free.

it is not but transmitting non-compacted string views isn't free either.

also, we're transitioning from a state where DF didn't use string views (eager compaction) to a state where DF uses string views (deferred compaction). Thus, if we convert to non-view types on output, we do not risk regressing anything. And we can revisit later whether returning view types could be an improvement (and introduce a flag, if needed & worth it).

findepi avatar Aug 24 '24 10:08 findepi

Thus, if we convert to non-view types on output, we do not risk regressing anything.

Right -- I agree so this is why in my mind having the default be "Utf8" is important

Thus, if we convert to non-view types on output, we do not risk regressing anything. And we can revisit later whether returning view types could be an improvement (and introduce a flag, if needed & worth it).

I would not be opposed to a PR that just hard codes the existing behavior (convert to Utf8 on output). I changed the title of this PR to reflect this.

I still think a config option offers the most flexibility but I do agree we could add it later if needed

alamb avatar Aug 26 '24 19:08 alamb

@findepi -- are you already working on this ticket? If not, I would like to pick it up.

wiedld avatar Aug 27 '24 18:08 wiedld

@wiedld awesome, go for it

findepi avatar Aug 27 '24 19:08 findepi

take

wiedld avatar Aug 28 '24 01:08 wiedld