datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

[Epic] Complete Initial `StringView` in DataFusion

Open alamb opened this issue 1 year ago • 5 comments

Is your feature request related to a problem or challenge?

This ticket is a follow on to https://github.com/apache/datafusion/issues/10918 where we implemented enough initial support for StringView / BinaryView that we can show some pretty sweet ClickBench results

Describe the solution you'd like

This epic tracks remaining work to complete the "initial" work which I would like to define as "enable using StringView when reading Strings from Parquet by default"

I am sure there will be additional work / support to add StringView to various other features of DataFusion that we can maybe track with another follow on ticket

Required for enabling StringView by default:

  • [ ] https://github.com/apache/datafusion/issues/11682
  • [x] https://github.com/apache/datafusion/issues/11766
  • [x] https://github.com/apache/datafusion/issues/11767
  • [ ] https://github.com/apache/datafusion/issues/12117
  • [ ] https://github.com/apache/datafusion/issues/12118
  • [ ] https://github.com/apache/datafusion/issues/12119
  • [ ] https://github.com/apache/datafusion/issues/12123
  • [ ] https://github.com/apache/datafusion/issues/12180

Could work around but really should be fixed upstream

  • [x] https://github.com/apache/arrow-rs/issues/6162
  • [x] https://github.com/apache/arrow-rs/issues/6164

Additional "Nice to have" Features

  • [ ] https://github.com/apache/datafusion/issues/11790
  • [ ] https://github.com/apache/datafusion/issues/11628
  • [x] https://github.com/apache/datafusion/issues/11881
  • [ ] https://github.com/apache/datafusion/issues/12031
  • [x] https://github.com/apache/datafusion/pull/12015

alamb avatar Jul 31 '24 19:07 alamb

An update here is that @XiangpengHao has a PR with various changes in https://github.com/apache/datafusion/pull/11862

We still need to check that PR and figure out what else is in that PR is needed to be enabled "for real" (with tests, etc)

alamb avatar Aug 22 '24 17:08 alamb

My ideal resolution here is that we end up in the state where the only change we need to enable string view by default is switch the config setting. I will do some more ticket triage later today to outline other items I know of

alamb avatar Aug 22 '24 17:08 alamb

Do we have tickets for regexp binary operators? (like ~, !~...) https://datafusion.apache.org/user-guide/sql/operators.html#op-re-match

I noticed stringview is not supported on them yet and they have separate implementation than regexp functions

Details

/*DML*/CREATE TABLE t0(v0 DOUBLE, v1 DOUBLE, v2 BOOLEAN, v3 BOOLEAN, v4 BOOLEAN, v5 STRING);
/*DML*/INSERT INTO t0(v1, v5, v2) VALUES (0.7183242196192607, 'Tn', true);
/*DML*/CREATE TABLE t0_stringview AS SELECT v0, v1, v2, v3, v4, arrow_cast(v5, 'Utf8View') as v5 FROM t0;

> select v5 ~ 'foo' from t0_stringview;
Internal error: Data type Utf8View not supported for binary_string_array_flag_op_scalar operation 'regexp_is_match' on string array.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

> select regexp_match(v5, 'foo') from t0_stringview;
+--------------------------------------------+
| regexp_match(t0_stringview.v5,Utf8("foo")) |
+--------------------------------------------+
|                                            |
+--------------------------------------------+
1 row(s) fetched.
Elapsed 0.034 seconds.

2010YOUY01 avatar Aug 24 '24 07:08 2010YOUY01

Do we have tickets for regexp binary operators? (like ~, !~...)

Not that I know of -- it would be great to add them

alamb avatar Aug 24 '24 09:08 alamb

Do we have tickets for regexp binary operators? (like , !...)

Filed https://github.com/apache/datafusion/issues/12180

alamb avatar Aug 26 '24 19:08 alamb

I am going to try and polish up PR to enable string view by default PR (with the arrow upgrade and various recent improvements) and see how close we are https://github.com/apache/datafusion/pull/12092

alamb avatar Oct 05 '24 10:10 alamb

StringView by default is finally merged into DataFusion: https://github.com/apache/datafusion/pull/13101 so I am claiming success and completion of this issue

alamb avatar Oct 30 '24 18:10 alamb