datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

[Epic] Implement support for `StringView` in DataFusion

Open alamb opened this issue 1 year ago • 5 comments

Is your feature request related to a problem or challenge?

StringView / BinaryView were added to the Arrow format that make it more suitable for certain types of operations on strings. Specifically when filtering with string data, creating the output StringArray requires copying the strings to a new, packed binary buffer.

GenericByteViewArrary was designed to solve this limitation and the arrow-rs implementation, tracked by https://github.com/apache/arrow-rs/issues/5374, is now complete enough to start adding support into DataFusion

I think we can improve performance in certain cases by using StringView (this is also described in more details in the Pola.rs blog post)

  1. Reading strings / binary from Parquet files as StringViewArray/BinaryViewArray rather than StringArray / BinaryArray saves a copy (and @ariesdevil is quite close to having it integrated into the parquet reader https://github.com/apache/arrow-rs/issues/5530)
  2. Evaluating predicates on string expressions (for example substr(url, 4) = 'http') as the intermediate result of substr can be called without copying string values

Describe the solution you'd like

I would like to support StringView / BinaryView support in DataFusion.

While my primary usecase is for reading data from parquet, I think teaching DataFusion to use StringView (at least as intermediate values when evaluating expressions may help significantly

Development branch: string-view

Since this feature requires upstream arrow-rs support https://github.com/apache/arrow-rs/issues/5374 that is not yet released we plan to do development on a string-view feature branch:

https://github.com/apache/datafusion/tree/string-view

Task List

Here are some high level tasks (I can help flesh these out for anyone who is interested in helping)

  • [x] https://github.com/apache/datafusion/issues/10920
  • [x] https://github.com/apache/datafusion/issues/10919
  • [x] https://github.com/apache/datafusion/issues/10998
  • [ ] https://github.com/apache/datafusion/issues/11024
  • [x] #10996
  • [ ] https://github.com/apache/datafusion/issues/10921
  • [x] https://github.com/apache/datafusion/issues/10961
  • [ ] https://github.com/apache/datafusion/issues/11024
  • [ ] https://github.com/apache/datafusion/issues/11025
  • [ ] #11023
  • [ ] https://github.com/apache/arrow-rs/pull/5936
  • [ ] Add generic coercion rule from Utf8View -> Utf8 / BinaryView -> Binary for compatibility
  • [ ] File tickets for adding native Utf8View / BinaryView support to other functions which would benefit
  • [ ] Establish pattern for implementing optimized string functions for StringViewArray
  • [ ] Find queries where StringView is likely to help performance significantly
  • [ ] ...

Describe alternatives you've considered

No response

Additional context

Polars implemented it recently in rust so that can serve as a motivation Blog Post https://pola.rs/posts/polars-string-type/ https://twitter.com/RitchieVink/status/1749466861069115790

Facebook/Velox's take: https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/

Related PRs: https://github.com/pola-rs/polars/pull/13748 https://github.com/pola-rs/polars/pull/13839 https://github.com/pola-rs/polars/pull/13489

alamb avatar Jun 14 '24 18:06 alamb

I think we should aim for a first "milestone" of showing improvements for some clickbench queries

alamb avatar Jun 14 '24 19:06 alamb

Will we completely change StringArray to StringViewArray in Datafusion? While I try to utilize StringViewArray in #10976 , I found there is schema mismatched issue UTF8 vs UTF8View. To avoid converting StringViewArray to StringArray, we might need to change the schema to UTF8View overall from logical plan to physical plan. If we need to keep both String and StringView, then we need to think about how to deal with the conversion between these two types.

A more concrete example is

statement ok
create table t(a int, b varchar, c int) as values (1, 'a', 3), (2, 'c', 1), (1, 'c', 2), (1, 'a', 4);

We have the string column b as StringArray and DataType::Utf8 now. Should we convert it to StringViewArray and DataType::Utf8View?

If not, if we somehow want to utilize StringViewArray, how do we minize the cost of conversion between String and StringView?

It seems Polars completely refactor their String to StringView 🤔

jayzhan211 avatar Jun 19 '24 03:06 jayzhan211

Will we completely change StringArray to StringViewArray in Datafusion?

I think since they are two separate types in Arrow we couldn't fully switch to StringView the way polars could as it controls the whole stack. Users could still feed DataFusion StringViewArray from custom TableProviders and would expect StringView at the output.

However what I think we could do is internally to DataFusion (e.g. within the plan, before the final output) is use StringView in the batches that flow through intermediate nodes in the plan.

I found there is schema mismatched issue UTF8 vs UTF8View. To avoid converting StringViewArray to StringArray, we might need to change the schema to UTF8View

Indeed, As you point out, I don't think we can transparently switch to using StringView -- instead we would have to start encoding information in the plans about the new types.

I wonder if we could have a new logical optimzier pass that tried to annotate operations that support it to use StringView in their schema rather than String. Then the ExecutionPlans would know if they were supposed to generate StringView as output or the more traditional StringArray 🤔

Here is an idea of one place to start: https://github.com/apache/datafusion/issues/9403#issuecomment-2178347730

alamb avatar Jun 19 '24 10:06 alamb

I think @XiangpengHao is looking into another place to use StringView which is https://github.com/apache/datafusion/issues/10921 -- where we have a similar idea to use StringView in some sub portion of the plan. Here is more info about the optimizer pass idea: https://github.com/apache/datafusion/issues/10921#issuecomment-2178364966

alamb avatar Jun 19 '24 10:06 alamb

I think this project is going pretty well :bowtie:

We are at the point of starting to implement some basic functions using StringView.

  • [ ] https://github.com/apache/datafusion/issues/11024
  • [ ] https://github.com/apache/datafusion/issues/11025

alamb avatar Jun 20 '24 10:06 alamb

Now that we have upgraded to arrow 52.1.0, I think we could merge the string-view branch to main. I'll try and make a PR if no one beats me to it

alamb avatar Jul 09 '24 20:07 alamb

An update here on the plan from @XiangpengHao:

  • He has local changes (that also require some additional features that will be released in arrow 52.2.0) that show significant performance improvements for TPCH and ClickBench queries
  • The hope is that these proposals are all up for review by the end of this week
  • So by the time arrow 52.2.0 is released (early August 2024) we'll be able to add an option that makes DataFusion use StringView when reading from Parquet / filtering
  • We also plan to write a blog post about this work / adventure

I for one am very excited

alamb avatar Jul 15 '24 15:07 alamb

Update: we merged the basic support for StringView in https://github.com/apache/datafusion/pull/11402

I have created a branch to collect any changes that rely on the pre-release parquet/arrow 52.2.0 version here: https://github.com/apache/datafusion/tree/string-view2

alamb avatar Jul 16 '24 19:07 alamb

Update here is that we hope to have enough code merged / implemented to report benchmark results by the end of the week. Once arrow-rs 52.2.0 is released https://github.com/apache/arrow-rs/issues/5998 then we'll merge the contents of https://github.com/apache/datafusion/tree/string-view2 to main

Once we do that we'll evaluate what, if anything, is left before we can close this ticket

alamb avatar Jul 22 '24 15:07 alamb

Update: the next major PR in this sequence is ready to merge: https://github.com/apache/datafusion/pull/11667

alamb avatar Jul 29 '24 12:07 alamb

Update here is that we have all the basic support in now but it is behind a feature flag. Let's declare this project a success for now and track the remaining work to enable this feature by default in a separate issue: https://github.com/apache/datafusion/issues/11752

Thanks to everyone who has helped so far. We are close ™️

alamb avatar Jul 31 '24 21:07 alamb

In case anyone wants an overview of adding StringView to DataFusion, here is a presentation and slides from @XiangpengHao Video: https://www.youtube.com/watch?v=RVLshX6fbds Slides: https://drive.google.com/file/d/1Qqd8V6cfS9rSQ_-JrinasQJwI79qlUEV/view?usp=drive_link

alamb avatar Aug 16 '24 11:08 alamb