datafusion
datafusion copied to clipboard
Improve StringView support for SUBSTR
Which issue does this PR close?
Closes #12031 Closes https://github.com/apache/datafusion/pull/12033
Rationale for this change
What changes are included in this PR?
- When it comes to
StringVIewinSUBSTR, operate directly on the views instead of generating newStrings. - I took the liberty of making the sqllogictest treat
Utf8Viewcolumn asText. - Added a bench file to record differences in the following conditions:
One is both the input and output of
SUBSTRis larger or smaller than 12B, for another the input is larger while the output is smaller than 12B. Also, compareUtf8View,Utf8, andLargeUtf8with each other.
Are these changes tested?
yes
Are there any user-facing changes?
no
The following is my latest bench report of SUBSTR:
SHORTER THAN 12/substr_string_view [size=1024, strlen=12]
time: [29.675 µs 29.964 µs 30.402 µs]
change: [-0.1382% +0.9754% +2.5969%] (p = 0.20 > 0.05)
No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high severe
SHORTER THAN 12/substr_string [size=1024, strlen=12]
time: [59.403 µs 59.823 µs 60.292 µs]
change: [+0.2182% +0.9242% +1.7185%] (p = 0.04 < 0.05)
Change within noise threshold.
SHORTER THAN 12/substr_large_string [size=1024, strlen=12]
time: [41.462 µs 41.775 µs 42.064 µs]
change: [-0.7169% +0.3030% +1.2513%] (p = 0.57 > 0.05)
No change in performance detected.
LONGER THAN 12/substr_string_view [size=1024, count=64, strlen=128]
time: [68.409 µs 68.896 µs 69.437 µs]
change: [-22.017% -10.028% +1.0113%] (p = 0.23 > 0.05)
No change in performance detected.
LONGER THAN 12/substr_string [size=1024, count=64, strlen=128]
time: [101.98 µs 102.20 µs 102.55 µs]
change: [-0.1908% +0.0183% +0.3687%] (p = 0.92 > 0.05)
No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high severe
LONGER THAN 12/substr_large_string [size=1024, count=64, strlen=128]
time: [101.90 µs 102.32 µs 102.80 µs]
change: [+0.1753% +0.5640% +1.0536%] (p = 0.01 < 0.05)
Change within noise threshold.
SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=1024, count=6, strlen=128]
time: [15.031 µs 15.180 µs 15.331 µs]
SRC_LEN > 12, SUB_LEN < 12/substr_string [size=1024, count=6, strlen=128]
time: [33.738 µs 34.057 µs 34.370 µs]
SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=1024, count=6, strlen=128]
time: [33.509 µs 34.122 µs 35.075 µs]
Found 2 outliers among 10 measurements (20.00%)
1 (10.00%) high mild
1 (10.00%) high severe
SHORTER THAN 12/substr_string_view [size=4096, strlen=12]
time: [120.22 µs 121.43 µs 122.70 µs]
change: [+1.3982% +2.3499% +3.4501%] (p = 0.00 < 0.05)
Performance has regressed.
SHORTER THAN 12/substr_string [size=4096, strlen=12]
time: [237.06 µs 251.39 µs 268.81 µs]
change: [+1.6838% +7.2415% +14.130%] (p = 0.04 < 0.05)
Performance has regressed.
SHORTER THAN 12/substr_large_string [size=4096, strlen=12]
time: [163.27 µs 164.24 µs 165.40 µs]
change: [-0.2464% +0.3706% +1.0851%] (p = 0.34 > 0.05)
No change in performance detected.
Found 2 outliers among 10 measurements (20.00%)
2 (20.00%) high mild
LONGER THAN 12/substr_string_view [size=4096, count=64, strlen=128]
time: [275.69 µs 277.81 µs 280.16 µs]
change: [+0.9658% +1.8029% +2.6242%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
LONGER THAN 12/substr_string [size=4096, count=64, strlen=128]
time: [416.84 µs 419.95 µs 423.71 µs]
change: [-1.6174% -0.0917% +1.2731%] (p = 0.91 > 0.05)
No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
LONGER THAN 12/substr_large_string [size=4096, count=64, strlen=128]
time: [419.60 µs 421.94 µs 424.48 µs]
change: [+1.5505% +2.0977% +2.7392%] (p = 0.00 < 0.05)
Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=4096, count=6, strlen=128]
time: [59.933 µs 60.122 µs 60.355 µs]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
SRC_LEN > 12, SUB_LEN < 12/substr_string [size=4096, count=6, strlen=128]
time: [133.30 µs 134.66 µs 135.97 µs]
SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=4096, count=6, strlen=128]
time: [131.37 µs 141.42 µs 158.42 µs]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high severe
substr_index_array_array_1000
time: [64.130 µs 68.632 µs 73.376 µs]
change: [-16.971% -10.021% -2.4160%] (p = 0.01 < 0.05)
Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
4 (4.00%) high mild
10 (10.00%) high severe
I think the PR is looking good, left some comments
BTW I hope to help this PR along tomorrow
- Get the str of the view by
value_unchecked, then get the [start, end)sub_view = if end-start>12 substr_large_view() else substr_small_view()- call
appned_view_u128_unchecked(sub_view)on th
I can open a PR for that append_u128 function in arrow if it's considered a good idea, and then adjust this PR.
Refactor the code again into the following behavior:
Collect the views using Vec<u128>, record the nulls using NullBufferBuilder, and construct the StringViewArray with new_unchecked.
Thanks @Kev1n8 ad @XiangpengHao -- I am running some benchmarks on this now.
BTW I was thinking that once we have completed https://github.com/apache/datafusion/issues/12119 we could potentially make substr always return Utf8View -- as it should always be better to use the views rather than Utf8Array
BTW I was thinking that once we have completed https://github.com/apache/datafusion/issues/12119 we could potentially make substr always return Utf8View -- as it should always be better to use the views rather than Utf8Array
Sounds great.
And I think we can also improve the TRIM function like this PR too, which is quite similar to SUBSTR.
Sorry -- here are the results of my benchmark run -- TLDR is this looks great to me
++ critcmp main stringview-output-for-substr
group main stringview-output-for-substr
----- ---- ----------------------------
LONGER THAN 12/substr_large_string [size=1024, count=64, strlen=128] 1.01 147.7±0.16µs ? ?/sec 1.00 146.4±0.10µs ? ?/sec
LONGER THAN 12/substr_large_string [size=4096, count=64, strlen=128] 1.00 590.8±0.14µs ? ?/sec 1.00 589.1±1.54µs ? ?/sec
LONGER THAN 12/substr_string [size=1024, count=64, strlen=128] 1.05 154.9±0.25µs ? ?/sec 1.00 148.0±0.11µs ? ?/sec
LONGER THAN 12/substr_string [size=4096, count=64, strlen=128] 1.08 636.7±0.81µs ? ?/sec 1.00 591.5±0.27µs ? ?/sec
LONGER THAN 12/substr_string_view [size=1024, count=64, strlen=128] 1.50 158.2±0.20µs ? ?/sec 1.00 105.4±0.04µs ? ?/sec
LONGER THAN 12/substr_string_view [size=4096, count=64, strlen=128] 1.50 629.0±0.18µs ? ?/sec 1.00 420.6±0.12µs ? ?/sec
SHORTER THAN 12/substr_large_string [size=1024, strlen=12] 1.00 56.8±0.02µs ? ?/sec 1.03 58.3±0.93µs ? ?/sec
SHORTER THAN 12/substr_large_string [size=4096, strlen=12] 1.00 223.8±0.30µs ? ?/sec 1.02 227.3±0.05µs ? ?/sec
SHORTER THAN 12/substr_string [size=1024, strlen=12] 1.05 89.3±0.05µs ? ?/sec 1.00 84.9±0.04µs ? ?/sec
SHORTER THAN 12/substr_string [size=4096, strlen=12] 1.05 351.7±0.32µs ? ?/sec 1.00 335.2±0.10µs ? ?/sec
SHORTER THAN 12/substr_string_view [size=1024, strlen=12] 2.82 58.4±0.04µs ? ?/sec 1.00 20.7±0.01µs ? ?/sec
SHORTER THAN 12/substr_string_view [size=4096, strlen=12] 2.79 227.8±0.05µs ? ?/sec 1.00 81.6±0.04µs ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=1024, count=6, strlen=128] 1.00 40.9±0.01µs ? ?/sec 1.00 41.1±0.03µs ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=4096, count=6, strlen=128] 1.01 157.6±0.13µs ? ?/sec 1.00 156.3±0.13µs ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_string [size=1024, count=6, strlen=128] 1.06 43.9±0.11µs ? ?/sec 1.00 41.5±0.08µs ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_string [size=4096, count=6, strlen=128] 1.07 169.7±0.03µs ? ?/sec 1.00 158.4±0.09µs ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=1024, count=6, strlen=128] 1.74 43.7±0.01µs ? ?/sec 1.00 25.1±0.04µs ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=4096, count=6, strlen=128] 1.69 166.4±0.07µs ? ?/sec 1.00 98.6±0.06µs ? ?/sec
I also merged up from main for this PR to get the arrow 53 release (and thus the changes to add arrow-data as a dependency were removed)
Filed https://github.com/apache/datafusion/issues/12338 to track the idea of using StringViewArray always
🚀