amoro icon indicating copy to clipboard operation
amoro copied to clipboard

[Bug]: Amoro optimization can result in the input files and the merge…

Open wardlican opened this issue 1 month ago • 1 comments

Why are the changes needed?

Amoro optimization can result in the input files and the merged output files having the same number of files, and this can cause the merge to fail and keep triggering the merge task.

Close #3855

Brief change log

  1. For undersizedSegmentFiles that result in only one file after bin-packing:
  2. If rewriting pos delete is not required, simply add them to rewriteDataFiles.
  3. During the second bin-packing, they can be merged with fragment files or other files.

How was this patch tested?

  • [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible

  • [ ] Add screenshots for manual tests if appropriate

  • [ ] Run test locally before making a pull request

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

wardlican avatar Oct 31 '25 12:10 wardlican

Please help review whether this repair plan is feasible.

Fix Results:

  • When the total size of the input files is less than targetSize, return Long.MAX_VALUE (existing logic).
  • When the total size of the input files is greater than or equal to targetSize but the average file size is less than targetSize, also return Long.MAX_VALUE (new logic).
  • Only return targetSize when the average file size is greater than or equal to targetSize.

This ensures that:

multi small files (even if the total size is greater than or equal to 128MB, but the average file size is less than 128MB) can be merged.

  • Avoiding the generation of output files with the same number of input files.
  • Solving the file merging problem at the task execution level.
  • The fix is ​​complete; small files should now be merged correctly.

wardlican avatar Oct 31 '25 12:10 wardlican

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Dec 22 '25 00:12 github-actions[bot]

@wardlican I have the same concern: could simply setting it to the maximum value result in a single file being excessively large? Could we perhaps refer to Spark's estimation logic? Refer to https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/actions/SizeBasedFileRewritePlanner.java#L199-L206

Okay, I'll look into whether we can use the design from Iceberg for this.

wardlican avatar Dec 22 '25 11:12 wardlican

Referencing Spark/Iceberg's SizeBasedFileRewritePlanner, the number of output files is determined intelligently:

Condition Decision
Remainder > minFileSize (default 0.75 * targetSize) Retain the remainder file (round up)
avgFileSizeWithoutRemainder < 1.1 * targetSize Distribute the remainder to other files (rounding down)
other Retain the remainder file (round up)

wardlican avatar Dec 23 '25 01:12 wardlican

Codecov Report

:x: Patch coverage is 10.71429% with 25 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 22.84%. Comparing base (12e7946) to head (c8a56c5). :warning: Report is 34 commits behind head on master.

Files with missing lines Patch % Lines
...pache/amoro/optimizing/IcebergRewriteExecutor.java 10.71% 23 Missing and 2 partials :warning:

:exclamation: There is a different number of reports uploaded between BASE (12e7946) and HEAD (c8a56c5). Click for more details.

HEAD has 4 uploads less than BASE
Flag BASE (12e7946) HEAD (c8a56c5)
core 4 0
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #3856      +/-   ##
============================================
- Coverage     29.03%   22.84%   -6.20%     
+ Complexity     3862     2531    -1331     
============================================
  Files           634      449     -185     
  Lines         50808    41074    -9734     
  Branches       6545     5788     -757     
============================================
- Hits          14753     9382    -5371     
+ Misses        34997    30882    -4115     
+ Partials       1058      810     -248     
Flag Coverage Δ
core ?
trino 22.84% <10.71%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov-commenter avatar Dec 23 '25 02:12 codecov-commenter