Zhen Li issues

Results 14 issues of


                                            Zhen Li

Store duplicate row address in vector for join probe

Problem When there are a large number of rows with the same key in the build side, the `listJoinResults` function becomes very time-consuming. Design `appendNextRow` Create a next-row-vector if it...

CLA Signed

Close the preloading data sources when the TableScan is terminated

**Problem** Memory leaks may occur when the split preloading feature is enabled, either the connector thread pool is busy or the task fails or is cancelled. We've observed instances of...

CLA Signed

Add normalize_nan Spark function

Add normalize_nan Spark function. In Spark's optimizer, `NormalizeNaNAndZero` are added for aggregations to normalize -0.0 / 0.0 and different NaN. In Velox, we don't need to handle 0.0 & -0.0,...

CLA Signed

Add rint Spark function

Doc: https://spark.apache.org/docs/latest/api/sql/#rint Code: https://github.com/apache/spark/blob/da92293f9ce0be1ac283c4a5d769af550abf7031/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L743

CLA Signed

ready-to-merge

Add levenshtein Spark function

Doc: https://spark.apache.org/docs/latest/api/sql/index.html#levenshtein Code:https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala#L2220C12-L2220C23 https://github.com/apache/spark/blob/d0385c4a99c172fa3e1ba2d72a65c8632b5c72a9/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L1694C5-L1694C77 There are two differences between Spark implementation and Presto's implementation: one is that Spark's return type is `int32_t`, and the other is that it accepts a...

CLA Signed

ready-to-merge

Improve 'insertForJoin' function performance by group prefetching

Apply the prefetching optimization for join probe to function 'insertForJoin' to improve it's performance. Fixes: #9732

CLA Signed

[DNR] test simd cmp

## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) (Fixes: \#ISSUE-ID) ## How was this patch tested? (Please explain how this patch...

[DNM] Test simd hash

## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) (Fixes: \#ISSUE-ID) ## How was this patch tested? (Please explain how this patch...

Improve Spark comparison functions performance by SIMD

Add __restrict annotations on the inputs to aid in auto-vectorization to speed Spark comparison functions. Store the result in a `std::vector` and then convert it to the result vector using...

CLA Signed

Add collect_set Spark aggregation function

Doc: https://docs.databricks.com/en/sql/language-manual/functions/collect_set.html Code: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L39C16-L39C23 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala#L147C12-L147C22 There are 3 semantic difference from `set_agg`: 1. Null values are excluded. ``` import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ val jsonStr = """{"txn":null}""" val jsonStr1 = """{"txn":null}"""...

CLA Signed