DRILL-8136: Overload existing Math UDFs to allow for VARCHAR input

Description

Wrote UDFs to overload existing Math functions in Drill so that Math UDFs accept input in type VARCHAR.

Documentation

All functions here that accept input, except for the bitwise functions, are overloaded.

Testing

Tested each function with unit tests in TestMathFunctionsVarcharInput.java

Feb 15 '22 15:02 estherbuchwalter

@cgivre, when you get a chance, would you please be able to look at this PR? I plan to add methods to overload the ones that have multiple inputs in order to account for all possible input combinations. Thank you!

Feb 15 '22 15:02 estherbuchwalter

Ok, thank you for the review. I used VarCharHolders because that's what had been used for the Math UDFs that I am overloading. Yes, I'll add some unit tests with null values. Also, I'm going to include a check for null in the functions so that null values don't cause crashes.

Feb 16 '22 21:02 estherbuchwalter

@estherbuchwalter Please don't forget the license headers so that this PR passes the CI.

Feb 18 '22 14:02 cgivre

@cgivre Thanks for the reminder about the headers!

Feb 18 '22 14:02 estherbuchwalter

As of now the random() method that accepts a seed only accepts input of type int. Does it make sense to allow the random(int seed) to accept type double/decimal as well?

Feb 18 '22 15:02 estherbuchwalter

This pull request introduces 3 alerts when merging 6df58ea6672aeedf79f73c2908b09068ba6df96f into ebd027bfd72f35267d5b15db3307079400383ca6 - view on LGTM.com

new alerts:

3 for Boxed variable is never null

Feb 18 '22 16:02 lgtm-com[bot]

Thanks for the contribution. Please explain the casting rules assumed here. Is the VARCHAR assumed to be an INT? BIGINT? DOUBLE? DECIMAL? That assumption affects type propagation, and is why normal SQL requires the user to specify the type via a CAST operation.

Feb 21 '22 22:02 paul-rogers

The functions are designed to accept a VARCHAR which can be an INT, BIGINT, DOUBLE, or DECIMAL.

Feb 21 '22 23:02 estherbuchwalter

@estherbuchwalter, thanks for the explanation. The function must, however, declare an output type. The other input must be cast to the correct type. For example, if the VARCHAR is to be cast to a DECIMAL, so must the other argument, and the output is DECIMAL.

On the other hand, if we infer the type from the numeric type, then 10 + "10.1" will produce 20, not 20.1.

SQL is a statically typed language: every row must produce the output type. Drill does defer the type decision to runtime, but the type decision becomes frozen on the first row.

To be clear, what is the output type for:

VARCHAR + INT --> ? VARCHAR + DOUBLE --> ? VARCHAR + DECIMAL -> ?

Since the VARCHAR can be anything, the safest result is if the output is always VARDECIMAL (unlimited precision) or DOUBLE (largest range of values, with some truncation). For any other output type, we can invent an input that will make the output type invalid.

Again, this is why SQL normally requires:

(CAST col AS INT) + 10 --> BIGINT

So that the user declares that the VARCHAR "col" column is converable to INT, and thus the normal INT addition type promotion rules apply.

Or, have you come up with some trick to work around this issue?

Feb 21 '22 23:02 paul-rogers

@paul-rogers, thank you for your question. Please let me know if this explanation satisfies your question.

In order to maintain precision, each input is parsed to Java type DOUBLE (with a few exceptions). I am working on minimizing, or if possible, completely getting rid of the exceptions. First there is a check to ensure that the input is a valid number (accepting INT or DOUBLE), and then the input is parsed to a DOUBLE. Then, the Math function is performed and the function's output is a Float8Holder, except for the sign() function where the output is an IntHolder. Are you suggesting that it would be better to return a VARDECIMAL than a Float8Holder?

Feb 22 '22 14:02 estherbuchwalter

To clarify, the purpose of this PR is to make Drill more resilient with different types of input to Math functions. The goal here is that Drill should not throw an error, and thereby put a halt on all of the query results, when one value within a large dataset is a different data type than what the Math function originally expected. Instead it should either know how to handle the different data type or return a value that conveys that the input was invalid (like null, NaN, etc). For example, before this PR the following SQL query would send a Drill error: SELECT abs('-10'). However, with this PR, the correct value would be returned (it would be returned as a Float8Holder).

Also, it seems that this functionality would be consistent with industry standards. I have tried different types of input to Math functions in MySQL and Postgres databases and each of them accepts VARCHAR input (although they do handle it a little differently from each other).

Feb 22 '22 19:02 estherbuchwalter

This pull request introduces 4 alerts when merging 12c1d08df784eb8147c28ecac2698f49ad46381e into bcabe9306b3a55d802e8e67d8ca798fe19ebfd46 - view on LGTM.com

new alerts:

3 for Boxed variable is never null
1 for Dereferenced variable may be null

Feb 28 '22 20:02 lgtm-com[bot]

This pull request introduces 3 alerts when merging 908971ba55be181523d55e7e1b35f260b03bc411 into bfc1b74d020908128c09ec7b37ac9a2ac1126860 - view on LGTM.com

new alerts:

3 for Boxed variable is never null

Mar 01 '22 19:03 lgtm-com[bot]

This pull request introduces 2 alerts when merging 71814839d7b5b9f71aa89859cad91207b09c3d0e into bfc1b74d020908128c09ec7b37ac9a2ac1126860 - view on LGTM.com

new alerts:

2 for Boxed variable is never null

Mar 02 '22 21:03 lgtm-com[bot]

I had a thought... one thing we might want to try is to change the null handling from NULL_IF_NULL to INTERNAL and then see if we can get the functions to return null in the event of unparsable input.

Mar 02 '22 22:03 cgivre

This PR is not necessary once https://github.com/apache/drill/pull/2638 is merged.

Sep 12 '22 04:09 cgivre

drill drill copied to clipboard