datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

feat: Support for StringSplit

Open Shekharrajak opened this issue 1 month ago • 6 comments

Fixes https://github.com/apache/datafusion-comet/issues/2707

We have scalar function: https://datafusion.apache.org/user-guide/sql/scalar_functions.html#string-to-array

Shekharrajak avatar Nov 13 '25 19:11 Shekharrajak

In the past I think we've encountered differences in Java and Rust's regex engines wrt graphemes. Could we get some larger UTF-8 characters in the tests?

mbutrovich avatar Nov 13 '25 20:11 mbutrovich

In the past I think we've encountered differences in Java and Rust's regex engines wrt graphemes. Could we get some larger UTF-8 characters in the tests?

We probably need to fall back to Spark unless this config is enabled:

  val COMET_REGEXP_ALLOW_INCOMPATIBLE: ConfigEntry[Boolean] =
    conf("spark.comet.regexp.allowIncompatible")
      .category(CATEGORY_EXEC)
      .doc("Comet is not currently fully compatible with Spark for all regular expressions. " +
        s"Set this config to true to allow them anyway. $COMPAT_GUIDE.")
      .booleanConf
      .createWithDefault(false)

andygrove avatar Nov 13 '25 20:11 andygrove

Thanks @Shekharrajak for you contribution, please add a function to the fuzztesting kit, similar to #2755

Thanks! Added in commit https://github.com/apache/datafusion-comet/pull/2772/commits/8eddd29184e4b287136a940cab20c2894c57fc0e

Shekharrajak avatar Nov 13 '25 20:11 Shekharrajak

In the past I think we've encountered differences in Java and Rust's regex engines wrt graphemes. Could we get some larger UTF-8 characters in the tests?

Added tests 987b646

Shekharrajak avatar Nov 13 '25 20:11 Shekharrajak

In the past I think we've encountered differences in Java and Rust's regex engines wrt graphemes. Could we get some larger UTF-8 characters in the tests?

We probably need to fall back to Spark unless this config is enabled:

  val COMET_REGEXP_ALLOW_INCOMPATIBLE: ConfigEntry[Boolean] =
    conf("spark.comet.regexp.allowIncompatible")
      .category(CATEGORY_EXEC)
      .doc("Comet is not currently fully compatible with Spark for all regular expressions. " +
        s"Set this config to true to allow them anyway. $COMPAT_GUIDE.")
      .booleanConf
      .createWithDefault(false)

How can we check if it is not falling back to Spark's JVM execution? @andygrove

Shekharrajak avatar Nov 13 '25 20:11 Shekharrajak

Codecov Report

:x: Patch coverage is 15.38462% with 11 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 54.98%. Comparing base (f09f8af) to head (2199b5f). :warning: Report is 783 commits behind head on main.

Files with missing lines Patch % Lines
...rc/main/scala/org/apache/comet/serde/strings.scala 8.33% 11 Missing :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2772      +/-   ##
============================================
- Coverage     56.12%   54.98%   -1.15%     
- Complexity      976     1329     +353     
============================================
  Files           119      167      +48     
  Lines         11743    15472    +3729     
  Branches       2251     2559     +308     
============================================
+ Hits           6591     8507    +1916     
- Misses         4012     5739    +1727     
- Partials       1140     1226      +86     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov-commenter avatar Nov 17 '25 21:11 codecov-commenter

Thanks @Shekharrajak Looks like there are rust check failures https://github.com/apache/datafusion-comet/actions/runs/19441578149/job/55638326879?pr=2772

Perhaps you can try cargo fix?

kazuyukitanimura avatar Dec 01 '25 17:12 kazuyukitanimura