datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

fix: Throws an exception when struct type has duplicate keys

Open wForget opened this issue 2 months ago • 4 comments

Which issue does this PR close?

Closes #2457.

Rationale for this change

In the failed test case of https://github.com/apache/datafusion-comet/pull/2444, I found that the struct type with duplicate keys will be deduplicated when converted to arrowType, which causes RowToColumanr to lose some columns.

What changes are included in this PR?

How are these changes tested?

This PR does not avoid the issue, but makes the error message more obvious. It was difficult to create a proper test case for this change until #2444 was merged.

Before this, the failing test case error in #2444 looked like:

[info] - Struct Star Expansion *** FAILED *** (2 seconds, 840 milliseconds)
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1230.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1230.0 (TID 1549) (96a252fb2e73 executor driver): java.lang.IndexOutOfBoundsException: Index 2 out of bounds for length 2
[info] 	at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
[info] 	at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
[info] 	at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
[info] 	at java.base/java.util.Objects.checkIndex(Objects.java:374)
[info] 	at java.base/java.util.ArrayList.get(ArrayList.java:459)
[info] 	at org.apache.comet.vector.CometStructVector.getChild(CometStructVector.java:55)
[info] 	at org.apache.spark.sql.vectorized.ColumnarRow.getInt(ColumnarRow.java:113)
[info] 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
[info] 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
[info] 	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
[info] 	at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)

after this:

[info] - Struct Star Expansion *** FAILED *** (2 seconds, 882 milliseconds)
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1325.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1325.0 (TID 1477) (6a0022271dbb executor driver): org.apache.spark.SparkException: Duplicated field names in Arrow Struct are not allowed, got [a, a, b, b].
[info] 	at org.apache.spark.sql.comet.util.Utils$.toArrowField(Utils.scala:161)
[info] 	at org.apache.spark.sql.comet.util.Utils$.$anonfun$toArrowSchema$1(Utils.scala:199)
[info] 	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
[info] 	at scala.collection.Iterator.foreach(Iterator.scala:943)
[info] 	at scala.collection.Iterator.foreach$(Iterator.scala:943)
[info] 	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
[info] 	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
[info] 	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)

wForget avatar Sep 26 '25 08:09 wForget

Codecov Report

:x: Patch coverage is 0% with 5 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 58.47%. Comparing base (f09f8af) to head (b10d422). :warning: Report is 544 commits behind head on main.

Files with missing lines Patch % Lines
.../scala/org/apache/spark/sql/comet/util/Utils.scala 0.00% 5 Missing :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2459      +/-   ##
============================================
+ Coverage     56.12%   58.47%   +2.34%     
- Complexity      976     1440     +464     
============================================
  Files           119      146      +27     
  Lines         11743    13511    +1768     
  Branches       2251     2350      +99     
============================================
+ Hits           6591     7900    +1309     
- Misses         4012     4379     +367     
- Partials       1140     1232      +92     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov-commenter avatar Sep 26 '25 08:09 codecov-commenter

@wForget will this PR close also https://github.com/apache/datafusion-comet/issues/2256

comphead avatar Oct 01 '25 19:10 comphead

@wForget will this PR close also #2256

No, this PR only fails prematurely when the schema has duplicate keys during r2c, and does not support struct with duplicate keys for comet.

wForget avatar Oct 02 '25 21:10 wForget

This PR does not make comet plan fallback when there is struct type with duplicate keys, but instead throws an exception. As comment on https://github.com/apache/datafusion-comet/issues/2457#issuecomment-3336738317, we are capable of supporting struct with duplicate keys, so I will mark this PR as draft and look forward to more discussions.

wForget avatar Oct 15 '25 14:10 wForget