[Bug]: Amoro optimization can result in the input files and the merge…
Why are the changes needed?
Amoro optimization can result in the input files and the merged output files having the same number of files, and this can cause the merge to fail and keep triggering the merge task.
Close #3855
Brief change log
- For undersizedSegmentFiles that result in only one file after bin-packing:
- If rewriting pos delete is not required, simply add them to rewriteDataFiles.
- During the second bin-packing, they can be merged with fragment files or other files.
How was this patch tested?
-
[ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible
-
[ ] Add screenshots for manual tests if appropriate
-
[ ] Run test locally before making a pull request
Documentation
- Does this pull request introduce a new feature? (yes / no)
- If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)
Please help review whether this repair plan is feasible.
Fix Results:
- When the total size of the input files is less than
targetSize, returnLong.MAX_VALUE(existing logic). - When the total size of the input files is greater than or equal to
targetSizebut the average file size is less thantargetSize, also returnLong.MAX_VALUE(new logic). - Only return
targetSizewhen the average file size is greater than or equal totargetSize.
This ensures that:
multi small files (even if the total size is greater than or equal to 128MB, but the average file size is less than 128MB) can be merged.
- Avoiding the generation of output files with the same number of input files.
- Solving the file merging problem at the task execution level.
- The fix is complete; small files should now be merged correctly.
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.
@wardlican I have the same concern: could simply setting it to the maximum value result in a single file being excessively large? Could we perhaps refer to Spark's estimation logic? Refer to https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/actions/SizeBasedFileRewritePlanner.java#L199-L206
Okay, I'll look into whether we can use the design from Iceberg for this.
Referencing Spark/Iceberg's SizeBasedFileRewritePlanner, the number of output files is determined intelligently:
| Condition | Decision |
|---|---|
| Remainder > minFileSize (default 0.75 * targetSize) | Retain the remainder file (round up) |
| avgFileSizeWithoutRemainder < 1.1 * targetSize | Distribute the remainder to other files (rounding down) |
| other | Retain the remainder file (round up) |
Codecov Report
:x: Patch coverage is 10.71429% with 25 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 22.84%. Comparing base (12e7946) to head (c8a56c5).
:warning: Report is 34 commits behind head on master.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| ...pache/amoro/optimizing/IcebergRewriteExecutor.java | 10.71% | 23 Missing and 2 partials :warning: |
:exclamation: There is a different number of reports uploaded between BASE (12e7946) and HEAD (c8a56c5). Click for more details.
HEAD has 4 uploads less than BASE
Flag BASE (12e7946) HEAD (c8a56c5) core 4 0
Additional details and impacted files
@@ Coverage Diff @@
## master #3856 +/- ##
============================================
- Coverage 29.03% 22.84% -6.20%
+ Complexity 3862 2531 -1331
============================================
Files 634 449 -185
Lines 50808 41074 -9734
Branches 6545 5788 -757
============================================
- Hits 14753 9382 -5371
+ Misses 34997 30882 -4115
+ Partials 1058 810 -248
| Flag | Coverage Δ | |
|---|---|---|
| core | ? |
|
| trino | 22.84% <10.71%> (?) |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.