[SPARK-49203][SQL] Add expression for `java.util.Arrays.binarySearch`
What changes were proposed in this pull request?
The pr aims to an expression array_binary_search for java.util.Arrays.binarySearch.
Why are the changes needed?
We can use it to implement histogram plot in the client side (no longer need to depend on mllib's Bucketizer.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Add new UT.
Was this patch authored or co-authored using generative AI tooling?
No.
Does any classic database, data warehouse, or competitor product have this function?
It seems strange to me to expand SQL APIs in this way
@panbingkun I have two questions:
- We already have
array_positionwhich does something similar. Granted it is 1 based, and it does not return the position of where you could insert your data. Why do we need this? - The relies on the data being sorted. A less savy user does not understand this. Either the function will be hard to use, or we have to add this to the type system. How should we tackle this?
We already have array_position which does something similar. Granted it is 1 based, and it does not return the position of where you could insert your data. Why do we need this?
hi @hvanhovell the array_position doesn't do the same thing if the value is not in the array:
array_position(array(1.0, 2.0, 3.0), 1.1) -> return 0
binary_search(array(1.0, 2.0, 3.0), 1.1) -> return -2
@LuciferYang @yaooqinn this is a dedicated expression for ML and PySpark Plotting, we want to implement the histogram without depending on mllib which is not compatible with Spark Connect.
We already have array_position which does something similar. Granted it is 1 based, and it does not return the position of where you could insert your data. Why do we need this?
hi @hvanhovell the array_position doesn't do the same thing if the value is not in the array:
array_position(array(1.0, 2.0, 3.0), 1.1) -> return 0 binary_search(array(1.0, 2.0, 3.0), 1.1) -> return -2@LuciferYang @yaooqinn this is a dedicated expression for ML and PySpark Plotting, we want to implement the histogram without depending on mllib which is not compatible with Spark Connect.
For this case, I have added the corresponding UT for data type byte, short, int, long, float and double in CollectionExpressionsSuite#ArrayBinarySearch
binary_search(array(1.0, 2.0, 3.0), 1.1) -> return -2
Is this expected behavior? In general it's more efficient to use expressions than UDF, but we should also care about the coherence of existing functions.
Is this expected behavior?
yes, this is expected. This is needed to figure out which interval the value fails in https://github.com/apache/spark/blob/8a4890ddb5541acacdde2b42fa0ff8781290c907/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala#L273-L301
we should also care about the coherence of existing functions.
since this is for internal purposes only, maybe we can relax this requirement.
is it possible to implement this expression with StaticInvoke? We can add a bunch of overloads that are specified for variant primitive types.
is it possible to implement this expression with
StaticInvoke? We can add a bunch of overloads that are specified for variant primitive types.
Let me investigate it.
is it possible to implement this expression with
StaticInvoke? We can add a bunch of overloads that are specified for variant primitive types.
Updated.
@cloud-fan
Thank you very much for your suggestions and tips!
Also I think the expression SortArray can also be simplified using StaticInvoke.
https://github.com/apache/spark/blob/b98ac058ab8800dfa1fa66aef67ea7e3e96677cd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1048-L1224
What do you think?
https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/java/util/Arrays.java#L2146-L2173
/**
* Searches the specified array for the specified object using the binary
* search algorithm. The array must be sorted into ascending order
* according to the
* {@linkplain Comparable natural ordering}
* of its elements (as by the
* {@link #sort(Object[])} method) prior to making this call.
* If it is not sorted, the results are undefined.
* (If the array contains elements that are not mutually comparable (for
* example, strings and integers), it <i>cannot</i> be sorted according
* to the natural ordering of its elements, hence results are undefined.)
* If the array contains multiple
* elements equal to the specified object, there is no guarantee which
* one will be found.
*
* @param a the array to be searched
* @param key the value to be searched for
* @return index of the search key, if it is contained in the array;
* otherwise, <code>(-(<i>insertion point</i>) - 1)</code>. The
* <i>insertion point</i> is defined as the point at which the
* key would be inserted into the array: the index of the first
* element greater than the key, or {@code a.length} if all
* elements in the array are less than the specified key. Note
* that this guarantees that the return value will be >= 0 if
* and only if the key is found.
* @throws ClassCastException if the search key is not comparable to the
* elements of the array.
*/
From the comments of the java.util.Arrays#binarySearch method:
- The input array must be in ascending order before calling this method; if the array is not sorted, the results are undefined. How we ensured this(As an internal function, perhaps we could add some emphatic comments? The rest would have to be left to code review (CR)?)?
- If the array contains multiple elements equal to the specified object, there is no guarantee which one will be found. Is this within our expectations?
https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/java/util/Arrays.java#L2146-L2173
/** * Searches the specified array for the specified object using the binary * search algorithm. The array must be sorted into ascending order * according to the * {@linkplain Comparable natural ordering} * of its elements (as by the * {@link #sort(Object[])} method) prior to making this call. * If it is not sorted, the results are undefined. * (If the array contains elements that are not mutually comparable (for * example, strings and integers), it <i>cannot</i> be sorted according * to the natural ordering of its elements, hence results are undefined.) * If the array contains multiple * elements equal to the specified object, there is no guarantee which * one will be found. * * @param a the array to be searched * @param key the value to be searched for * @return index of the search key, if it is contained in the array; * otherwise, <code>(-(<i>insertion point</i>) - 1)</code>. The * <i>insertion point</i> is defined as the point at which the * key would be inserted into the array: the index of the first * element greater than the key, or {@code a.length} if all * elements in the array are less than the specified key. Note * that this guarantees that the return value will be >= 0 if * and only if the key is found. * @throws ClassCastException if the search key is not comparable to the * elements of the array. */From the comments of the
java.util.Arrays#binarySearchmethod:
- The input array must be in ascending order before calling this method; if the array is not sorted, the results are undefined. How we ensured this?
- If the array contains multiple elements equal to the specified object, there is no guarantee which one will be found. Is this within our expectations?
https://issues.apache.org/jira/browse/SPARK-49203
The second one, I think, is also within our expectations and has no impact on implementing
connect plotting function, right? @zhengruifeng
https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/java/util/Arrays.java#L2146-L2173
/** * Searches the specified array for the specified object using the binary * search algorithm. The array must be sorted into ascending order * according to the * {@linkplain Comparable natural ordering} * of its elements (as by the * {@link #sort(Object[])} method) prior to making this call. * If it is not sorted, the results are undefined. * (If the array contains elements that are not mutually comparable (for * example, strings and integers), it <i>cannot</i> be sorted according * to the natural ordering of its elements, hence results are undefined.) * If the array contains multiple * elements equal to the specified object, there is no guarantee which * one will be found. * * @param a the array to be searched * @param key the value to be searched for * @return index of the search key, if it is contained in the array; * otherwise, <code>(-(<i>insertion point</i>) - 1)</code>. The * <i>insertion point</i> is defined as the point at which the * key would be inserted into the array: the index of the first * element greater than the key, or {@code a.length} if all * elements in the array are less than the specified key. Note * that this guarantees that the return value will be >= 0 if * and only if the key is found. * @throws ClassCastException if the search key is not comparable to the * elements of the array. */From the comments of the
java.util.Arrays#binarySearchmethod:
- The input array must be in ascending order before calling this method; if the array is not sorted, the results are undefined. How we ensured this?
- If the array contains multiple elements equal to the specified object, there is no guarantee which one will be found. Is this within our expectations?
https://issues.apache.org/jira/browse/SPARK-49203
The second one, I think, is also within our expectations and has no impact on implementing
connect plotting function, right? @zhengruifeng
I think this point should be added to the function comments and emphasize that the input must be sorted in ascending order.
https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/java/util/Arrays.java#L2146-L2173
/** * Searches the specified array for the specified object using the binary * search algorithm. The array must be sorted into ascending order * according to the * {@linkplain Comparable natural ordering} * of its elements (as by the * {@link #sort(Object[])} method) prior to making this call. * If it is not sorted, the results are undefined. * (If the array contains elements that are not mutually comparable (for * example, strings and integers), it <i>cannot</i> be sorted according * to the natural ordering of its elements, hence results are undefined.) * If the array contains multiple * elements equal to the specified object, there is no guarantee which * one will be found. * * @param a the array to be searched * @param key the value to be searched for * @return index of the search key, if it is contained in the array; * otherwise, <code>(-(<i>insertion point</i>) - 1)</code>. The * <i>insertion point</i> is defined as the point at which the * key would be inserted into the array: the index of the first * element greater than the key, or {@code a.length} if all * elements in the array are less than the specified key. Note * that this guarantees that the return value will be >= 0 if * and only if the key is found. * @throws ClassCastException if the search key is not comparable to the * elements of the array. */From the comments of the
java.util.Arrays#binarySearchmethod:
- The input array must be in ascending order before calling this method; if the array is not sorted, the results are undefined. How we ensured this?
- If the array contains multiple elements equal to the specified object, there is no guarantee which one will be found. Is this within our expectations?
https://issues.apache.org/jira/browse/SPARK-49203
The second one, I think, is also within our expectations and has no impact on implementing
connect plotting function, right? @zhengruifengI think this point should be added to the function comments and emphasize that the input must be sorted in ascending order.
Okay.
I am going to merge this one in 2 days, if no more comments. it is only for internal usage, so should be safe.
@panbingkun +1 to simplify SortArray.
@panbingkun +1 to simplify
SortArray.
Great, let me try to do it. Thanks!
thanks, merged to master
Thanks all!
The second one, I think, is also within our expectations and has no impact on implementing