[Bug]: Amoro optimization can result in the input files and the merged output files having the same number of files, and this can cause the merge to fail and keep triggering the merge task.
What happened?
Amoro optimization can result in the input files and the merged output files having the same number of files, and this can cause the merge to fail and keep triggering the merge task.
As shown below
Affects Versions
0.7.1
What table formats are you seeing the problem on?
Iceberg
What engines are you seeing the problem on?
Spark
How to reproduce
No response
Relevant log output
Anything else
No response
Are you willing to submit a PR?
- [x] Yes I am willing to submit a PR!
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
Root Cause of the Problem
In IcebergRewriteExecutor.targetSize(), when the total size of the input files is greater than or equal to targetSize, it returns targetSize (instead of Long.MAX_VALUE). This causes:
- Even if each file is small (e.g., an average of 3MB), if the total size is greater than or equal to 128MB,
targetSize()will still return 128MB. UnpartitionedWritermay scroll to new files when it reaches 128MB.- If each file is small, each file may produce a separate output file, making merging impossible.
I have some questions — if the issue occurred with the segment files, why is the input file size less than 1 MB? Also, if the segment doesn’t have any delete files that need to be merged, why are those files still included in the rewrite set? Or perhaps the problem lies in the bin-packing for undersized segment files logic?
I have some questions — if the issue occurred with the segment files, why is the input file size less than 1 MB? Also, if the segment doesn’t have any delete files that need to be merged, why are those files still included in the rewrite set? Or perhaps the problem lies in the bin-packing for undersized segment files logic?
This is the test scenario I created, with the target file size set to around 50KB, instead of the default 128MB.In this example, each data file is approximately 12KB.
@wardlican I couldn't reproduce this scenario. I think the main issue is related to the data; the inaccurate calculation of the Writer's scroll size is causing this phenomenon.
In your scenario, Binpack is using the size of self-optimizing.max-task-size-bytes instead of self-optimizing.target-size.
If possible, try setting self-optimizing.max-task-size-bytes to match self-optimizing.target-size (50KB).
@wardlican I couldn't reproduce this scenario. I think the main issue is related to the data; the inaccurate calculation of the Writer's scroll size is causing this phenomenon.
In your scenario, Binpack is using the size of
self-optimizing.max-task-size-bytesinstead ofself-optimizing.target-size.If possible, try setting
self-optimizing.max-task-size-bytesto matchself-optimizing.target-size(50KB).
Using self-optimizing.max-task-size-bytes does not solve this problem. We also encountered this issue in our production environment. The default values for both target and self-optimizing.max-task-size-bytes are 128MB.
@wardlican I couldn't reproduce this scenario. I think the main issue is related to the data; the inaccurate calculation of the Writer's scroll size is causing this phenomenon. In your scenario, Binpack is using the size of
self-optimizing.max-task-size-bytesinstead ofself-optimizing.target-size. If possible, try settingself-optimizing.max-task-size-bytesto matchself-optimizing.target-size(50KB).Using self-optimizing.max-task-size-bytes does not solve this problem. We also encountered this issue in our production environment. The default values for both target and self-optimizing.max-task-size-bytes are 128MB.
I tried using #3856 to avoid this problem.
This is a rather serious problem, as it can lead to endless table merging and a continuous expansion of metadata.