datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

fix: remove fallback for int8 and int16 types with native Parquet readers

Open mbutrovich opened this issue 3 months ago • 5 comments

Which issue does this PR close?

Closes #2195, #1348.

Rationale for this change

We previously fell back in a scenario where the native reader produced different results. We've now picked up appropriate DF and arrow-rs dependencies to have apache/arrow#7055.

What changes are included in this PR?

Remove the fallback.

How are these changes tested?

Existing tests.

mbutrovich avatar Aug 19 '25 19:08 mbutrovich

Running Spark SQL tests locally with iceberg_compat. If that passes, I'll likely want to undo #1376 and try again.

mbutrovich avatar Aug 19 '25 19:08 mbutrovich

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 45.92%. Comparing base (f09f8af) to head (d401b98). :warning: Report is 691 commits behind head on main.

Additional details and impacted files
@@              Coverage Diff              @@
##               main    #2197       +/-   ##
=============================================
- Coverage     56.12%   45.92%   -10.20%     
- Complexity      976     1219      +243     
=============================================
  Files           119      164       +45     
  Lines         11743    13989     +2246     
  Branches       2251     2369      +118     
=============================================
- Hits           6591     6425      -166     
- Misses         4012     6525     +2513     
+ Partials       1140     1039      -101     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov-commenter avatar Aug 19 '25 20:08 codecov-commenter

You may want to undo parts of https://github.com/apache/datafusion-comet/pull/1376 to really test the change. The changes to CometTestBase.getPrimitiveTypesParquetSchema were so that the schema would not contain unsigned types if the reader type was one of the datafusion based readers. Undoing that change will be a primary test.

parthchandra avatar Aug 19 '25 20:08 parthchandra

So instead of getting null on the incorrectly written values we now get different values. So garbage in is still garbage out, but it's new garbage!

mbutrovich avatar Aug 19 '25 21:08 mbutrovich

I seem to recall that I tested a badly written file with https://github.com/apache/arrow/issues/7055 and did not get garbage. I'll take a look at it again. Thanks for kicking off this effort @mbutrovich!

parthchandra avatar Aug 19 '25 21:08 parthchandra