drill
drill copied to clipboard
DRILL-8136: Overload existing Math UDFs to allow for VARCHAR input
DRILL-8136: Overload existing Math UDFs to allow for VARCHAR input
Description
Wrote UDFs to overload existing Math functions in Drill so that Math UDFs accept input in type VARCHAR.
Documentation
All functions here that accept input, except for the bitwise functions, are overloaded.
Testing
Tested each function with unit tests in TestMathFunctionsVarcharInput.java
@cgivre, when you get a chance, would you please be able to look at this PR? I plan to add methods to overload the ones that have multiple inputs in order to account for all possible input combinations. Thank you!
Ok, thank you for the review. I used VarCharHolders
because that's what had been used for the Math UDFs that I am overloading.
Yes, I'll add some unit tests with null
values. Also, I'm going to include a check for null
in the functions so that null
values don't cause crashes.
@estherbuchwalter Please don't forget the license headers so that this PR passes the CI.
@cgivre Thanks for the reminder about the headers!
As of now the random() method that accepts a seed only accepts input of type int
. Does it make sense to allow the random(int seed) to accept type double
/decimal
as well?
This pull request introduces 3 alerts when merging 6df58ea6672aeedf79f73c2908b09068ba6df96f into ebd027bfd72f35267d5b15db3307079400383ca6 - view on LGTM.com
new alerts:
- 3 for Boxed variable is never null
Thanks for the contribution. Please explain the casting rules assumed here. Is the VARCHAR assumed to be an INT? BIGINT? DOUBLE? DECIMAL? That assumption affects type propagation, and is why normal SQL requires the user to specify the type via a CAST operation.
The functions are designed to accept a VARCHAR which can be an INT, BIGINT, DOUBLE, or DECIMAL.
@estherbuchwalter, thanks for the explanation. The function must, however, declare an output type. The other input must be cast to the correct type. For example, if the VARCHAR is to be cast to a DECIMAL, so must the other argument, and the output is DECIMAL.
On the other hand, if we infer the type from the numeric type, then 10 + "10.1" will produce 20, not 20.1.
SQL is a statically typed language: every row must produce the output type. Drill does defer the type decision to runtime, but the type decision becomes frozen on the first row.
To be clear, what is the output type for:
VARCHAR + INT --> ? VARCHAR + DOUBLE --> ? VARCHAR + DECIMAL -> ?
Since the VARCHAR can be anything, the safest result is if the output is always VARDECIMAL (unlimited precision) or DOUBLE (largest range of values, with some truncation). For any other output type, we can invent an input that will make the output type invalid.
Again, this is why SQL normally requires:
(CAST col AS INT) + 10 --> BIGINT
So that the user declares that the VARCHAR "col" column is converable to INT, and thus the normal INT addition type promotion rules apply.
Or, have you come up with some trick to work around this issue?
@paul-rogers, thank you for your question. Please let me know if this explanation satisfies your question.
In order to maintain precision, each input is parsed to Java type DOUBLE
(with a few exceptions). I am working on minimizing, or if possible, completely getting rid of the exceptions.
First there is a check to ensure that the input is a valid number (accepting INT
or DOUBLE
), and then the input is parsed to a DOUBLE
.
Then, the Math function is performed and the function's output is a Float8Holder
, except for the sign()
function where the output is an IntHolder
.
Are you suggesting that it would be better to return a VARDECIMAL
than a Float8Holder
?
To clarify, the purpose of this PR is to make Drill more resilient with different types of input to Math functions. The goal here is that Drill should not throw an error, and thereby put a halt on all of the query results, when one value within a large dataset is a different data type than what the Math function originally expected. Instead it should either know how to handle the different data type or return a value that conveys that the input was invalid (like null
, NaN
, etc).
For example, before this PR the following SQL query would send a Drill error: SELECT abs('-10')
. However, with this PR, the correct value would be returned (it would be returned as a Float8Holder).
Also, it seems that this functionality would be consistent with industry standards. I have tried different types of input to Math functions in MySQL and Postgres databases and each of them accepts VARCHAR input (although they do handle it a little differently from each other).
This pull request introduces 4 alerts when merging 12c1d08df784eb8147c28ecac2698f49ad46381e into bcabe9306b3a55d802e8e67d8ca798fe19ebfd46 - view on LGTM.com
new alerts:
- 3 for Boxed variable is never null
- 1 for Dereferenced variable may be null
This pull request introduces 3 alerts when merging 908971ba55be181523d55e7e1b35f260b03bc411 into bfc1b74d020908128c09ec7b37ac9a2ac1126860 - view on LGTM.com
new alerts:
- 3 for Boxed variable is never null
This pull request introduces 2 alerts when merging 71814839d7b5b9f71aa89859cad91207b09c3d0e into bfc1b74d020908128c09ec7b37ac9a2ac1126860 - view on LGTM.com
new alerts:
- 2 for Boxed variable is never null
I had a thought... one thing we might want to try is to change the null handling from NULL_IF_NULL
to INTERNAL
and then see if we can get the functions to return null
in the event of unparsable input.
This PR is not necessary once https://github.com/apache/drill/pull/2638 is merged.