datafusion-comet
datafusion-comet copied to clipboard
fix: Throws an exception when struct type has duplicate keys
Which issue does this PR close?
Closes #2457.
Rationale for this change
In the failed test case of https://github.com/apache/datafusion-comet/pull/2444, I found that the struct type with duplicate keys will be deduplicated when converted to arrowType, which causes RowToColumanr to lose some columns.
What changes are included in this PR?
How are these changes tested?
This PR does not avoid the issue, but makes the error message more obvious. It was difficult to create a proper test case for this change until #2444 was merged.
Before this, the failing test case error in #2444 looked like:
[info] - Struct Star Expansion *** FAILED *** (2 seconds, 840 milliseconds)
[info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1230.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1230.0 (TID 1549) (96a252fb2e73 executor driver): java.lang.IndexOutOfBoundsException: Index 2 out of bounds for length 2
[info] at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
[info] at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
[info] at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
[info] at java.base/java.util.Objects.checkIndex(Objects.java:374)
[info] at java.base/java.util.ArrayList.get(ArrayList.java:459)
[info] at org.apache.comet.vector.CometStructVector.getChild(CometStructVector.java:55)
[info] at org.apache.spark.sql.vectorized.ColumnarRow.getInt(ColumnarRow.java:113)
[info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
[info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
[info] at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
[info] at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
after this:
[info] - Struct Star Expansion *** FAILED *** (2 seconds, 882 milliseconds)
[info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1325.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1325.0 (TID 1477) (6a0022271dbb executor driver): org.apache.spark.SparkException: Duplicated field names in Arrow Struct are not allowed, got [a, a, b, b].
[info] at org.apache.spark.sql.comet.util.Utils$.toArrowField(Utils.scala:161)
[info] at org.apache.spark.sql.comet.util.Utils$.$anonfun$toArrowSchema$1(Utils.scala:199)
[info] at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
[info] at scala.collection.Iterator.foreach(Iterator.scala:943)
[info] at scala.collection.Iterator.foreach$(Iterator.scala:943)
[info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
[info] at scala.collection.IterableLike.foreach(IterableLike.scala:74)
[info] at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
Codecov Report
:x: Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 58.47%. Comparing base (f09f8af) to head (b10d422).
:warning: Report is 544 commits behind head on main.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| .../scala/org/apache/spark/sql/comet/util/Utils.scala | 0.00% | 5 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #2459 +/- ##
============================================
+ Coverage 56.12% 58.47% +2.34%
- Complexity 976 1440 +464
============================================
Files 119 146 +27
Lines 11743 13511 +1768
Branches 2251 2350 +99
============================================
+ Hits 6591 7900 +1309
- Misses 4012 4379 +367
- Partials 1140 1232 +92
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@wForget will this PR close also https://github.com/apache/datafusion-comet/issues/2256
@wForget will this PR close also #2256
No, this PR only fails prematurely when the schema has duplicate keys during r2c, and does not support struct with duplicate keys for comet.
This PR does not make comet plan fallback when there is struct type with duplicate keys, but instead throws an exception. As comment on https://github.com/apache/datafusion-comet/issues/2457#issuecomment-3336738317, we are capable of supporting struct with duplicate keys, so I will mark this PR as draft and look forward to more discussions.