amoro icon indicating copy to clipboard operation
amoro copied to clipboard

[Bug]: Amoro optimization can result in the input files and the merged output files having the same number of files, and this can cause the merge to fail and keep triggering the merge task.

Open wardlican opened this issue 1 month ago • 7 comments

What happened?

Amoro optimization can result in the input files and the merged output files having the same number of files, and this can cause the merge to fail and keep triggering the merge task. As shown below Image

Image

Affects Versions

0.7.1

What table formats are you seeing the problem on?

Iceberg

What engines are you seeing the problem on?

Spark

How to reproduce

No response

Relevant log output


Anything else

No response

Are you willing to submit a PR?

  • [x] Yes I am willing to submit a PR!

Code of Conduct

  • [x] I agree to follow this project's Code of Conduct

wardlican avatar Oct 31 '25 11:10 wardlican

Root Cause of the Problem

In IcebergRewriteExecutor.targetSize(), when the total size of the input files is greater than or equal to targetSize, it returns targetSize (instead of Long.MAX_VALUE). This causes:

  • Even if each file is small (e.g., an average of 3MB), if the total size is greater than or equal to 128MB, targetSize() will still return 128MB.
  • UnpartitionedWriter may scroll to new files when it reaches 128MB.
  • If each file is small, each file may produce a separate output file, making merging impossible.

wardlican avatar Oct 31 '25 11:10 wardlican

I have some questions — if the issue occurred with the segment files, why is the input file size less than 1 MB? Also, if the segment doesn’t have any delete files that need to be merged, why are those files still included in the rewrite set? Or perhaps the problem lies in the bin-packing for undersized segment files logic?

xxubai avatar Nov 06 '25 02:11 xxubai

I have some questions — if the issue occurred with the segment files, why is the input file size less than 1 MB? Also, if the segment doesn’t have any delete files that need to be merged, why are those files still included in the rewrite set? Or perhaps the problem lies in the bin-packing for undersized segment files logic?

This is the test scenario I created, with the target file size set to around 50KB, instead of the default 128MB.In this example, each data file is approximately 12KB.

wardlican avatar Nov 07 '25 09:11 wardlican

@wardlican I couldn't reproduce this scenario. I think the main issue is related to the data; the inaccurate calculation of the Writer's scroll size is causing this phenomenon.

In your scenario, Binpack is using the size of self-optimizing.max-task-size-bytes instead of self-optimizing.target-size.

If possible, try setting self-optimizing.max-task-size-bytes to match self-optimizing.target-size (50KB).

zhongqishang avatar Nov 19 '25 03:11 zhongqishang

@wardlican I couldn't reproduce this scenario. I think the main issue is related to the data; the inaccurate calculation of the Writer's scroll size is causing this phenomenon.

In your scenario, Binpack is using the size of self-optimizing.max-task-size-bytes instead of self-optimizing.target-size.

If possible, try setting self-optimizing.max-task-size-bytes to match self-optimizing.target-size (50KB).

Using self-optimizing.max-task-size-bytes does not solve this problem. We also encountered this issue in our production environment. The default values ​​for both target and self-optimizing.max-task-size-bytes are 128MB.

wardlican avatar Nov 19 '25 11:11 wardlican

@wardlican I couldn't reproduce this scenario. I think the main issue is related to the data; the inaccurate calculation of the Writer's scroll size is causing this phenomenon. In your scenario, Binpack is using the size of self-optimizing.max-task-size-bytes instead of self-optimizing.target-size. If possible, try setting self-optimizing.max-task-size-bytes to match self-optimizing.target-size (50KB).

Using self-optimizing.max-task-size-bytes does not solve this problem. We also encountered this issue in our production environment. The default values ​​for both target and self-optimizing.max-task-size-bytes are 128MB.

I tried using #3856 to avoid this problem.

wardlican avatar Nov 19 '25 11:11 wardlican

This is a rather serious problem, as it can lead to endless table merging and a continuous expansion of metadata.

wardlican avatar Nov 19 '25 11:11 wardlican