amoro [Feature]: Support rewriting manifest files automatically for tables

Description

The state of the manifest files can affect the query performance of the data lake tables. Amoro should support rewriting manifest files automatically for tables to ensure query performance and storage overhead.

Use case/motivation

Just like self-optimizing on data files, self-optimizing on manifest files can be set up through simple configuration. After that, Amoro should automatically determine when to perform optimization and submit it.

Describe the solution

No response

Subtasks

No response

Related issues

No response

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Oct 24 '23 07:10 zhoujinsong

I think the benefit of automatically rewriting the manifest is small:

From the perspective of management of fragmented files, according to org.apache.iceberg.ManifestMergeManager#mergeGroup, Iceberg will merge the small manifests automatically when committing. When committing, if the total count of manifest files exceeds 'commit.manifest.min-count-to-merge', or if the total size of accumulated small manifest files (excluding those from the current commit) exceeds 'commit.manifest.target-size-bytes', an automatic manifest merge (a bin-packing operation) takes place.

Given this mechanism in Iceberg, automatic manifest rewriting essentially means proactively merging manifest files that have not yet reached the threshold to trigger a merge, but the impact is minimal. For example, with an 8MB target size, I have a set of test data as follows:

The above test data was conducted in a single-threaded planning scenario, excluding the impact of multithreading. In reality, with the default multithreaded planning scenario, having 1 rewritten manifest will slow down the plan as it reduces concurrency. Returning to the test results, we can see that the impact on planning for fewer than 100 fragmented manifest files is minimal.

From a functional perspective, the commonly used scenario for the "rewrite manifest" feature is clustering by partition. Rewriting manifests based on partitions can achieve the effect of file skipping when querying or planning based on partitions, but this effect is limited. For tables without deleted files, there won't be significant performance bottlenecks in planning (in my local tests, planning for a table with 10 million data files and 10 8MB manifest files took at most a dozen seconds). The majority of planning time is spent comparing with delete file metrics (determining whether delete files are associated with data files). The flame graph below for a table with 1 million data files and 50,000 delete files confirms this:

Therefore, file skipping becomes less important because the bottleneck is not there at all. Of course, I have also tested the effect of file skipping. Here's the conclusion: in the absence of deleted files, the more partitions the data filter skips, the faster the planning process. The following is the test data:

It seems that skipping half of the partitions doubles the performance. However, in this scenario, where there are no deleted files, if we divide the planning into two phases, the first phase involves scanning all manifest files, skipping those whose partition lower and upper bounds are not within the range (where the "rewrite manifests cluster by partition" comes into play), and then reading all the manifest entries of data files. The second phase involves comparing all data file manifest entries with all delete file entries' metrics to find the associated ones. As mentioned earlier, the second phase is actually the most time-consuming.

Using the test data above, filtering half of the 10 million data files takes 19.36 seconds in planning. Just by adding 10,000 delete files, the planning time will increase to minutes. Using 1 minute as an example, file skipping only affects the first phase. Therefore, the final optimization result would only reduce the time from 1 minute to 50 seconds (referring to 19.35 seconds to 9.32 seconds), rather than from 1 minute to 30 seconds. This is because the number of manifest entries read from data files remains the same, and there will still be as many entries to compare with delete entries, only skipping some manifest files when reading the entries.

In summary, the importance and priority of this issue are relatively low.

Dec 15 '23 09:12 HuangFru

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

Aug 21 '24 00:08 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

Sep 04 '24 00:09 github-actions[bot]