Zhen Li
Zhen Li
Problem When there are a large number of rows with the same key in the build side, the `listJoinResults` function becomes very time-consuming. Design `appendNextRow` Create a next-row-vector if it...
**Problem** Memory leaks may occur when the split preloading feature is enabled, either the connector thread pool is busy or the task fails or is cancelled. We've observed instances of...
Add normalize_nan Spark function. In Spark's optimizer, `NormalizeNaNAndZero` are added for aggregations to normalize -0.0 / 0.0 and different NaN. In Velox, we don't need to handle 0.0 & -0.0,...
Doc: https://spark.apache.org/docs/latest/api/sql/#rint Code: https://github.com/apache/spark/blob/da92293f9ce0be1ac283c4a5d769af550abf7031/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L743
Doc: https://spark.apache.org/docs/latest/api/sql/index.html#levenshtein Code:https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L2220C12-L2220C23 https://github.com/apache/spark/blob/d0385c4a99c172fa3e1ba2d72a65c8632b5c72a9/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L1694C5-L1694C77 There are two differences between Spark implementation and Presto's implementation: one is that Spark's return type is `int32_t`, and the other is that it accepts a...
Apply the prefetching optimization for join probe to function 'insertForJoin' to improve it's performance. Fixes: #9732
## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) (Fixes: \#ISSUE-ID) ## How was this patch tested? (Please explain how this patch...
## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) (Fixes: \#ISSUE-ID) ## How was this patch tested? (Please explain how this patch...
Add __restrict annotations on the inputs to aid in auto-vectorization to speed Spark comparison functions. Store the result in a `std::vector` and then convert it to the result vector using...
Doc: https://docs.databricks.com/en/sql/language-manual/functions/collect_set.html Code: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L39C16-L39C23 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L147C12-L147C22 There are 3 semantic difference from `set_agg`: 1. Null values are excluded. ``` import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ val jsonStr = """{"txn":null}""" val jsonStr1 = """{"txn":null}"""...