datafusion
datafusion copied to clipboard
Fix: Internal error in regexp_replace() for some StringView input
Which issue does this PR close?
Closes #12150 & closes #11912
Rationale for this change
What changes are included in this PR?
Currently the StringView data type cannot be cast toas_generic_string_array. The changes here resolve that: https://github.com/apache/datafusion/pull/12203/files#diff-0007996e12eb1b2e63363974424f13330f85a7fc48e0c55381f8a30a0f372931R206
Another point of issue was in the return_type match statement when using a function within the query on Utf8View that appears to be returning a ScalarArray of StringArray
For example:
SELECT
REGEXP_REPLACE(column1_utf8view, lower(column1_utf8view), 'bar') AS k
FROM test;
lower(column1_utf8view) would return a StringArray instead of a StringViewArray which would cause the result of the query to return a StringArray. For now I am implementing a conversion step https://github.com/apache/datafusion/pull/12203/files#diff-0007996e12eb1b2e63363974424f13330f85a7fc48e0c55381f8a30a0f372931R516 to coerce the return value in to being a StringView as required by the signatures.
I think this is okay. Would love to work on optimizing/making this code better if possible though. For now this appears to be the path forward for the error I was seeing.
Are these changes tested?
Are there any user-facing changes?
There are a few things I noted when I reviewed this pull request, some of which may be good to either fix here or in other tickets/PRs:
- The return type includes Utf8View, but the code doesn't do that
- The return type tests for Binary/LargeBinary/BinaryView/Dictionary but those are unsupported signature types
- The signature is missing
Exact(vec![Utf8View, Utf8, Utf8, Utf8]), - I don't see anywhere that the args are actually checked to have 3 or 4 elements anymore
- ~~There are a bunch of macros in the tests that as best as I can see don't seem to be used at all. At least when I remove them the tests in that file all run. Given that I don't think the tests cover anything but StringArray as arg[0]~~ Correction, the macros write out tests, my mistake.
There are a few things I noted when I reviewed this pull request, some of which may be good to either fix here or in other tickets/PRs:
- The return type includes Utf8View, but the code doesn't do that
- The return type tests for Binary/LargeBinary/BinaryView/Dictionary but those are unsupported signature types
- The signature is missing
Exact(vec![Utf8View, Utf8, Utf8, Utf8]),- I don't see anywhere that the args are actually checked to have 3 or 4 elements anymore
- ~There are a bunch of macros in the tests that as best as I can see don't seem to be used at all. At least when I remove them the tests in that file all run. Given that I don't think the tests cover anything but StringArray as arg[0]~ Correction, the macros write out tests, my mistake.
I've added some additional sqllogictests for flags & modified the type signature for
The signature is missing Exact(vec![Utf8View, Utf8, Utf8, Utf8]),
Could you please elaborate more on the return types in the first bullet?
Thanks!
Could you please elaborate more on the return types in the first bullet?
Sure. That was one I saw that wasn't related to anything in this PR but that I think might be improved.
fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
use DataType::*;
Ok(match &arg_types[0] {
LargeUtf8 | LargeBinary => LargeUtf8,
Utf8 | Binary => Utf8,
Utf8View | BinaryView => Utf8View,
Null => Null,
Dictionary(_, t) => match **t {
LargeUtf8 | LargeBinary => LargeUtf8,
Utf8 | Binary => Utf8,
Null => Null,
_ => {
return plan_err!(
"the regexp_replace can only accept strings but got {:?}",
**t
);
}
},
other => {
return plan_err!(
"The regexp_replace function can only accept strings. Got {other}"
);
}
})
}
Unless I'm mistaken that return type method's args are the same as what is passed to invoke (and what is declared in the signature) and that is just Utf8, LargeUtf8 and Utf8View (and the signature doesn't even cover the LargeUtf8 case). Basically, I think it needs to be cleaned up wrt types to be consistent.
Note that the above comment could (should?) be another PR - it was just noted by me when looking at the code for this.
@alamb not sure if you have eyes on this but its a StringView related bug 🫡
I ran the regexp benchmarks on this PR and verified that there are no changes for the existing types
++ critcmp main fix/12150-regexprep-err-2
group main
----- ----
regexp_like_1000 1.00 3.8±0.01ms ? ?/sec
regexp_match_1000 1.00 4.6±0.01ms ? ?/sec
regexp_replace_1000 1.00 3.5±0.01ms ? ?/sec
🚀