datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Improve StringView support for SUBSTR

Open Kev1n8 opened this issue 1 year ago • 6 comments

Which issue does this PR close?

Closes #12031 Closes https://github.com/apache/datafusion/pull/12033

Rationale for this change

What changes are included in this PR?

  1. When it comes to StringVIew in SUBSTR, operate directly on the views instead of generating new Strings.
  2. I took the liberty of making the sqllogictest treat Utf8View column as Text.
  3. Added a bench file to record differences in the following conditions: One is both the input and output of SUBSTR is larger or smaller than 12B, for another the input is larger while the output is smaller than 12B. Also, compare Utf8View, Utf8, and LargeUtf8 with each other.

Are these changes tested?

yes

Are there any user-facing changes?

no

Kev1n8 avatar Aug 17 '24 16:08 Kev1n8

The following is my latest bench report of SUBSTR:


SHORTER THAN 12/substr_string_view [size=1024, strlen=12]
                        time:   [29.675 µs 29.964 µs 30.402 µs]
                        change: [-0.1382% +0.9754% +2.5969%] (p = 0.20 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
SHORTER THAN 12/substr_string [size=1024, strlen=12]
                        time:   [59.403 µs 59.823 µs 60.292 µs]
                        change: [+0.2182% +0.9242% +1.7185%] (p = 0.04 < 0.05)
                        Change within noise threshold.
SHORTER THAN 12/substr_large_string [size=1024, strlen=12]
                        time:   [41.462 µs 41.775 µs 42.064 µs]
                        change: [-0.7169% +0.3030% +1.2513%] (p = 0.57 > 0.05)
                        No change in performance detected.

LONGER THAN 12/substr_string_view [size=1024, count=64, strlen=128]
                        time:   [68.409 µs 68.896 µs 69.437 µs]
                        change: [-22.017% -10.028% +1.0113%] (p = 0.23 > 0.05)
                        No change in performance detected.
LONGER THAN 12/substr_string [size=1024, count=64, strlen=128]
                        time:   [101.98 µs 102.20 µs 102.55 µs]
                        change: [-0.1908% +0.0183% +0.3687%] (p = 0.92 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
LONGER THAN 12/substr_large_string [size=1024, count=64, strlen=128]
                        time:   [101.90 µs 102.32 µs 102.80 µs]
                        change: [+0.1753% +0.5640% +1.0536%] (p = 0.01 < 0.05)
                        Change within noise threshold.

SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=1024, count=6, strlen=128]
                        time:   [15.031 µs 15.180 µs 15.331 µs]
SRC_LEN > 12, SUB_LEN < 12/substr_string [size=1024, count=6, strlen=128]
                        time:   [33.738 µs 34.057 µs 34.370 µs]
SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=1024, count=6, strlen=128]
                        time:   [33.509 µs 34.122 µs 35.075 µs]
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) high mild
  1 (10.00%) high severe

SHORTER THAN 12/substr_string_view [size=4096, strlen=12]
                        time:   [120.22 µs 121.43 µs 122.70 µs]
                        change: [+1.3982% +2.3499% +3.4501%] (p = 0.00 < 0.05)
                        Performance has regressed.
SHORTER THAN 12/substr_string [size=4096, strlen=12]
                        time:   [237.06 µs 251.39 µs 268.81 µs]
                        change: [+1.6838% +7.2415% +14.130%] (p = 0.04 < 0.05)
                        Performance has regressed.
SHORTER THAN 12/substr_large_string [size=4096, strlen=12]
                        time:   [163.27 µs 164.24 µs 165.40 µs]
                        change: [-0.2464% +0.3706% +1.0851%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high mild

LONGER THAN 12/substr_string_view [size=4096, count=64, strlen=128]
                        time:   [275.69 µs 277.81 µs 280.16 µs]
                        change: [+0.9658% +1.8029% +2.6242%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
LONGER THAN 12/substr_string [size=4096, count=64, strlen=128]
                        time:   [416.84 µs 419.95 µs 423.71 µs]
                        change: [-1.6174% -0.0917% +1.2731%] (p = 0.91 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
LONGER THAN 12/substr_large_string [size=4096, count=64, strlen=128]
                        time:   [419.60 µs 421.94 µs 424.48 µs]
                        change: [+1.5505% +2.0977% +2.7392%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=4096, count=6, strlen=128]
                        time:   [59.933 µs 60.122 µs 60.355 µs]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
SRC_LEN > 12, SUB_LEN < 12/substr_string [size=4096, count=6, strlen=128]
                        time:   [133.30 µs 134.66 µs 135.97 µs]
SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=4096, count=6, strlen=128]
                        time:   [131.37 µs 141.42 µs 158.42 µs]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

substr_index_array_array_1000
                        time:   [64.130 µs 68.632 µs 73.376 µs]
                        change: [-16.971% -10.021% -2.4160%] (p = 0.01 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe

Kev1n8 avatar Aug 17 '24 16:08 Kev1n8

I think the PR is looking good, left some comments

XiangpengHao avatar Aug 18 '24 00:08 XiangpengHao

BTW I hope to help this PR along tomorrow

alamb avatar Aug 20 '24 20:08 alamb

  • Get the str of the view by value_unchecked, then get the [start, end)
  • sub_view = if end-start>12 substr_large_view() else substr_small_view()
  • call appned_view_u128_unchecked(sub_view) on th

I can open a PR for that append_u128 function in arrow if it's considered a good idea, and then adjust this PR.


Refactor the code again into the following behavior: Collect the views using Vec<u128>, record the nulls using NullBufferBuilder, and construct the StringViewArray with new_unchecked.

Kev1n8 avatar Aug 22 '24 16:08 Kev1n8

Thanks @Kev1n8 ad @XiangpengHao -- I am running some benchmarks on this now.

BTW I was thinking that once we have completed https://github.com/apache/datafusion/issues/12119 we could potentially make substr always return Utf8View -- as it should always be better to use the views rather than Utf8Array

alamb avatar Aug 25 '24 15:08 alamb

BTW I was thinking that once we have completed https://github.com/apache/datafusion/issues/12119 we could potentially make substr always return Utf8View -- as it should always be better to use the views rather than Utf8Array

Sounds great.

And I think we can also improve the TRIM function like this PR too, which is quite similar to SUBSTR.

Kev1n8 avatar Aug 28 '24 19:08 Kev1n8

Sorry -- here are the results of my benchmark run -- TLDR is this looks great to me

++ critcmp main stringview-output-for-substr
group                                                                              main                                   stringview-output-for-substr
-----                                                                              ----                                   ----------------------------
LONGER THAN 12/substr_large_string [size=1024, count=64, strlen=128]               1.01    147.7±0.16µs        ? ?/sec    1.00    146.4±0.10µs        ? ?/sec
LONGER THAN 12/substr_large_string [size=4096, count=64, strlen=128]               1.00    590.8±0.14µs        ? ?/sec    1.00    589.1±1.54µs        ? ?/sec
LONGER THAN 12/substr_string [size=1024, count=64, strlen=128]                     1.05    154.9±0.25µs        ? ?/sec    1.00    148.0±0.11µs        ? ?/sec
LONGER THAN 12/substr_string [size=4096, count=64, strlen=128]                     1.08    636.7±0.81µs        ? ?/sec    1.00    591.5±0.27µs        ? ?/sec
LONGER THAN 12/substr_string_view [size=1024, count=64, strlen=128]                1.50    158.2±0.20µs        ? ?/sec    1.00    105.4±0.04µs        ? ?/sec
LONGER THAN 12/substr_string_view [size=4096, count=64, strlen=128]                1.50    629.0±0.18µs        ? ?/sec    1.00    420.6±0.12µs        ? ?/sec
SHORTER THAN 12/substr_large_string [size=1024, strlen=12]                         1.00     56.8±0.02µs        ? ?/sec    1.03     58.3±0.93µs        ? ?/sec
SHORTER THAN 12/substr_large_string [size=4096, strlen=12]                         1.00    223.8±0.30µs        ? ?/sec    1.02    227.3±0.05µs        ? ?/sec
SHORTER THAN 12/substr_string [size=1024, strlen=12]                               1.05     89.3±0.05µs        ? ?/sec    1.00     84.9±0.04µs        ? ?/sec
SHORTER THAN 12/substr_string [size=4096, strlen=12]                               1.05    351.7±0.32µs        ? ?/sec    1.00    335.2±0.10µs        ? ?/sec
SHORTER THAN 12/substr_string_view [size=1024, strlen=12]                          2.82     58.4±0.04µs        ? ?/sec    1.00     20.7±0.01µs        ? ?/sec
SHORTER THAN 12/substr_string_view [size=4096, strlen=12]                          2.79    227.8±0.05µs        ? ?/sec    1.00     81.6±0.04µs        ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=1024, count=6, strlen=128]    1.00     40.9±0.01µs        ? ?/sec    1.00     41.1±0.03µs        ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=4096, count=6, strlen=128]    1.01    157.6±0.13µs        ? ?/sec    1.00    156.3±0.13µs        ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_string [size=1024, count=6, strlen=128]          1.06     43.9±0.11µs        ? ?/sec    1.00     41.5±0.08µs        ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_string [size=4096, count=6, strlen=128]          1.07    169.7±0.03µs        ? ?/sec    1.00    158.4±0.09µs        ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=1024, count=6, strlen=128]     1.74     43.7±0.01µs        ? ?/sec    1.00     25.1±0.04µs        ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=4096, count=6, strlen=128]     1.69    166.4±0.07µs        ? ?/sec    1.00     98.6±0.06µs        ? ?/sec

alamb avatar Sep 05 '24 12:09 alamb

I also merged up from main for this PR to get the arrow 53 release (and thus the changes to add arrow-data as a dependency were removed)

alamb avatar Sep 05 '24 12:09 alamb

Filed https://github.com/apache/datafusion/issues/12338 to track the idea of using StringViewArray always

alamb avatar Sep 05 '24 12:09 alamb

🚀

alamb avatar Sep 06 '24 20:09 alamb