feat: Support for StringSplit
Fixes https://github.com/apache/datafusion-comet/issues/2707
We have scalar function: https://datafusion.apache.org/user-guide/sql/scalar_functions.html#string-to-array
In the past I think we've encountered differences in Java and Rust's regex engines wrt graphemes. Could we get some larger UTF-8 characters in the tests?
In the past I think we've encountered differences in Java and Rust's regex engines wrt graphemes. Could we get some larger UTF-8 characters in the tests?
We probably need to fall back to Spark unless this config is enabled:
val COMET_REGEXP_ALLOW_INCOMPATIBLE: ConfigEntry[Boolean] =
conf("spark.comet.regexp.allowIncompatible")
.category(CATEGORY_EXEC)
.doc("Comet is not currently fully compatible with Spark for all regular expressions. " +
s"Set this config to true to allow them anyway. $COMPAT_GUIDE.")
.booleanConf
.createWithDefault(false)
Thanks @Shekharrajak for you contribution, please add a function to the fuzztesting kit, similar to #2755
Thanks! Added in commit https://github.com/apache/datafusion-comet/pull/2772/commits/8eddd29184e4b287136a940cab20c2894c57fc0e
In the past I think we've encountered differences in Java and Rust's regex engines wrt graphemes. Could we get some larger UTF-8 characters in the tests?
Added tests 987b646
In the past I think we've encountered differences in Java and Rust's regex engines wrt graphemes. Could we get some larger UTF-8 characters in the tests?
We probably need to fall back to Spark unless this config is enabled:
val COMET_REGEXP_ALLOW_INCOMPATIBLE: ConfigEntry[Boolean] = conf("spark.comet.regexp.allowIncompatible") .category(CATEGORY_EXEC) .doc("Comet is not currently fully compatible with Spark for all regular expressions. " + s"Set this config to true to allow them anyway. $COMPAT_GUIDE.") .booleanConf .createWithDefault(false)
How can we check if it is not falling back to Spark's JVM execution? @andygrove
Codecov Report
:x: Patch coverage is 15.38462% with 11 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 54.98%. Comparing base (f09f8af) to head (2199b5f).
:warning: Report is 783 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| ...rc/main/scala/org/apache/comet/serde/strings.scala | 8.33% | 11 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #2772 +/- ##
============================================
- Coverage 56.12% 54.98% -1.15%
- Complexity 976 1329 +353
============================================
Files 119 167 +48
Lines 11743 15472 +3729
Branches 2251 2559 +308
============================================
+ Hits 6591 8507 +1916
- Misses 4012 5739 +1727
- Partials 1140 1226 +86
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Thanks @Shekharrajak Looks like there are rust check failures https://github.com/apache/datafusion-comet/actions/runs/19441578149/job/55638326879?pr=2772
Perhaps you can try cargo fix?