spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-49203][SQL] Add expression for `java.util.Arrays.binarySearch`

Open panbingkun opened this issue 1 year ago • 15 comments

What changes were proposed in this pull request?

The pr aims to an expression array_binary_search for java.util.Arrays.binarySearch.

Why are the changes needed?

We can use it to implement histogram plot in the client side (no longer need to depend on mllib's Bucketizer.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add new UT.

Was this patch authored or co-authored using generative AI tooling?

No.

panbingkun avatar Aug 13 '24 11:08 panbingkun

Does any classic database, data warehouse, or competitor product have this function?

LuciferYang avatar Aug 13 '24 12:08 LuciferYang

It seems strange to me to expand SQL APIs in this way

yaooqinn avatar Aug 13 '24 15:08 yaooqinn

@panbingkun I have two questions:

  • We already have array_position which does something similar. Granted it is 1 based, and it does not return the position of where you could insert your data. Why do we need this?
  • The relies on the data being sorted. A less savy user does not understand this. Either the function will be hard to use, or we have to add this to the type system. How should we tackle this?

hvanhovell avatar Aug 13 '24 16:08 hvanhovell

We already have array_position which does something similar. Granted it is 1 based, and it does not return the position of where you could insert your data. Why do we need this?

hi @hvanhovell the array_position doesn't do the same thing if the value is not in the array:

array_position(array(1.0, 2.0, 3.0), 1.1) -> return 0
binary_search(array(1.0, 2.0, 3.0), 1.1) -> return -2

@LuciferYang @yaooqinn this is a dedicated expression for ML and PySpark Plotting, we want to implement the histogram without depending on mllib which is not compatible with Spark Connect.

zhengruifeng avatar Aug 14 '24 01:08 zhengruifeng

We already have array_position which does something similar. Granted it is 1 based, and it does not return the position of where you could insert your data. Why do we need this?

hi @hvanhovell the array_position doesn't do the same thing if the value is not in the array:

array_position(array(1.0, 2.0, 3.0), 1.1) -> return 0
binary_search(array(1.0, 2.0, 3.0), 1.1) -> return -2

@LuciferYang @yaooqinn this is a dedicated expression for ML and PySpark Plotting, we want to implement the histogram without depending on mllib which is not compatible with Spark Connect.

For this case, I have added the corresponding UT for data type byte, short, int, long, float and double in CollectionExpressionsSuite#ArrayBinarySearch

panbingkun avatar Aug 14 '24 01:08 panbingkun

binary_search(array(1.0, 2.0, 3.0), 1.1) -> return -2

Is this expected behavior? In general it's more efficient to use expressions than UDF, but we should also care about the coherence of existing functions.

cloud-fan avatar Aug 14 '24 05:08 cloud-fan

Is this expected behavior?

yes, this is expected. This is needed to figure out which interval the value fails in https://github.com/apache/spark/blob/8a4890ddb5541acacdde2b42fa0ff8781290c907/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala#L273-L301

we should also care about the coherence of existing functions.

since this is for internal purposes only, maybe we can relax this requirement.

zhengruifeng avatar Aug 14 '24 11:08 zhengruifeng

is it possible to implement this expression with StaticInvoke? We can add a bunch of overloads that are specified for variant primitive types.

cloud-fan avatar Aug 14 '24 17:08 cloud-fan

is it possible to implement this expression with StaticInvoke? We can add a bunch of overloads that are specified for variant primitive types.

Let me investigate it.

panbingkun avatar Aug 15 '24 03:08 panbingkun

is it possible to implement this expression with StaticInvoke? We can add a bunch of overloads that are specified for variant primitive types.

Updated.

panbingkun avatar Aug 15 '24 11:08 panbingkun

@cloud-fan Thank you very much for your suggestions and tips! Also I think the expression SortArray can also be simplified using StaticInvoke. https://github.com/apache/spark/blob/b98ac058ab8800dfa1fa66aef67ea7e3e96677cd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala#L1048-L1224 What do you think?

panbingkun avatar Aug 15 '24 11:08 panbingkun

https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/java/util/Arrays.java#L2146-L2173

    /**
     * Searches the specified array for the specified object using the binary
     * search algorithm. The array must be sorted into ascending order
     * according to the
     * {@linkplain Comparable natural ordering}
     * of its elements (as by the
     * {@link #sort(Object[])} method) prior to making this call.
     * If it is not sorted, the results are undefined.
     * (If the array contains elements that are not mutually comparable (for
     * example, strings and integers), it <i>cannot</i> be sorted according
     * to the natural ordering of its elements, hence results are undefined.)
     * If the array contains multiple
     * elements equal to the specified object, there is no guarantee which
     * one will be found.
     *
     * @param a the array to be searched
     * @param key the value to be searched for
     * @return index of the search key, if it is contained in the array;
     *         otherwise, <code>(-(<i>insertion point</i>) - 1)</code>.  The
     *         <i>insertion point</i> is defined as the point at which the
     *         key would be inserted into the array: the index of the first
     *         element greater than the key, or {@code a.length} if all
     *         elements in the array are less than the specified key.  Note
     *         that this guarantees that the return value will be &gt;= 0 if
     *         and only if the key is found.
     * @throws ClassCastException if the search key is not comparable to the
     *         elements of the array.
     */

From the comments of the java.util.Arrays#binarySearch method:

  1. The input array must be in ascending order before calling this method; if the array is not sorted, the results are undefined. How we ensured this(As an internal function, perhaps we could add some emphatic comments? The rest would have to be left to code review (CR)?)?
  2. If the array contains multiple elements equal to the specified object, there is no guarantee which one will be found. Is this within our expectations?

LuciferYang avatar Aug 16 '24 03:08 LuciferYang

https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/java/util/Arrays.java#L2146-L2173

    /**
     * Searches the specified array for the specified object using the binary
     * search algorithm. The array must be sorted into ascending order
     * according to the
     * {@linkplain Comparable natural ordering}
     * of its elements (as by the
     * {@link #sort(Object[])} method) prior to making this call.
     * If it is not sorted, the results are undefined.
     * (If the array contains elements that are not mutually comparable (for
     * example, strings and integers), it <i>cannot</i> be sorted according
     * to the natural ordering of its elements, hence results are undefined.)
     * If the array contains multiple
     * elements equal to the specified object, there is no guarantee which
     * one will be found.
     *
     * @param a the array to be searched
     * @param key the value to be searched for
     * @return index of the search key, if it is contained in the array;
     *         otherwise, <code>(-(<i>insertion point</i>) - 1)</code>.  The
     *         <i>insertion point</i> is defined as the point at which the
     *         key would be inserted into the array: the index of the first
     *         element greater than the key, or {@code a.length} if all
     *         elements in the array are less than the specified key.  Note
     *         that this guarantees that the return value will be &gt;= 0 if
     *         and only if the key is found.
     * @throws ClassCastException if the search key is not comparable to the
     *         elements of the array.
     */

From the comments of the java.util.Arrays#binarySearch method:

  1. The input array must be in ascending order before calling this method; if the array is not sorted, the results are undefined. How we ensured this?
  2. If the array contains multiple elements equal to the specified object, there is no guarantee which one will be found. Is this within our expectations?

https://issues.apache.org/jira/browse/SPARK-49203 image The second one, I think, is also within our expectations and has no impact on implementing connect plotting function, right? @zhengruifeng

panbingkun avatar Aug 16 '24 03:08 panbingkun

https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/java/util/Arrays.java#L2146-L2173

    /**
     * Searches the specified array for the specified object using the binary
     * search algorithm. The array must be sorted into ascending order
     * according to the
     * {@linkplain Comparable natural ordering}
     * of its elements (as by the
     * {@link #sort(Object[])} method) prior to making this call.
     * If it is not sorted, the results are undefined.
     * (If the array contains elements that are not mutually comparable (for
     * example, strings and integers), it <i>cannot</i> be sorted according
     * to the natural ordering of its elements, hence results are undefined.)
     * If the array contains multiple
     * elements equal to the specified object, there is no guarantee which
     * one will be found.
     *
     * @param a the array to be searched
     * @param key the value to be searched for
     * @return index of the search key, if it is contained in the array;
     *         otherwise, <code>(-(<i>insertion point</i>) - 1)</code>.  The
     *         <i>insertion point</i> is defined as the point at which the
     *         key would be inserted into the array: the index of the first
     *         element greater than the key, or {@code a.length} if all
     *         elements in the array are less than the specified key.  Note
     *         that this guarantees that the return value will be &gt;= 0 if
     *         and only if the key is found.
     * @throws ClassCastException if the search key is not comparable to the
     *         elements of the array.
     */

From the comments of the java.util.Arrays#binarySearch method:

  1. The input array must be in ascending order before calling this method; if the array is not sorted, the results are undefined. How we ensured this?
  2. If the array contains multiple elements equal to the specified object, there is no guarantee which one will be found. Is this within our expectations?

https://issues.apache.org/jira/browse/SPARK-49203 image The second one, I think, is also within our expectations and has no impact on implementing connect plotting function, right? @zhengruifeng

I think this point should be added to the function comments and emphasize that the input must be sorted in ascending order.

LuciferYang avatar Aug 16 '24 03:08 LuciferYang

https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/java/util/Arrays.java#L2146-L2173

    /**
     * Searches the specified array for the specified object using the binary
     * search algorithm. The array must be sorted into ascending order
     * according to the
     * {@linkplain Comparable natural ordering}
     * of its elements (as by the
     * {@link #sort(Object[])} method) prior to making this call.
     * If it is not sorted, the results are undefined.
     * (If the array contains elements that are not mutually comparable (for
     * example, strings and integers), it <i>cannot</i> be sorted according
     * to the natural ordering of its elements, hence results are undefined.)
     * If the array contains multiple
     * elements equal to the specified object, there is no guarantee which
     * one will be found.
     *
     * @param a the array to be searched
     * @param key the value to be searched for
     * @return index of the search key, if it is contained in the array;
     *         otherwise, <code>(-(<i>insertion point</i>) - 1)</code>.  The
     *         <i>insertion point</i> is defined as the point at which the
     *         key would be inserted into the array: the index of the first
     *         element greater than the key, or {@code a.length} if all
     *         elements in the array are less than the specified key.  Note
     *         that this guarantees that the return value will be &gt;= 0 if
     *         and only if the key is found.
     * @throws ClassCastException if the search key is not comparable to the
     *         elements of the array.
     */

From the comments of the java.util.Arrays#binarySearch method:

  1. The input array must be in ascending order before calling this method; if the array is not sorted, the results are undefined. How we ensured this?
  2. If the array contains multiple elements equal to the specified object, there is no guarantee which one will be found. Is this within our expectations?

https://issues.apache.org/jira/browse/SPARK-49203 image The second one, I think, is also within our expectations and has no impact on implementing connect plotting function, right? @zhengruifeng

I think this point should be added to the function comments and emphasize that the input must be sorted in ascending order.

Okay.

panbingkun avatar Aug 16 '24 05:08 panbingkun

I am going to merge this one in 2 days, if no more comments. it is only for internal usage, so should be safe.

zhengruifeng avatar Sep 02 '24 12:09 zhengruifeng

@panbingkun +1 to simplify SortArray.

cloud-fan avatar Sep 03 '24 01:09 cloud-fan

@panbingkun +1 to simplify SortArray.

Great, let me try to do it. Thanks!

panbingkun avatar Sep 03 '24 01:09 panbingkun

thanks, merged to master

zhengruifeng avatar Sep 03 '24 06:09 zhengruifeng

Thanks all!

panbingkun avatar Sep 03 '24 07:09 panbingkun