amoro icon indicating copy to clipboard operation
amoro copied to clipboard

[Improvement]: Table partition files list performance issue

Open link3280 opened this issue 1 year ago • 9 comments

Search before asking

  • [X] I have searched in the issues and found no similar issues.

What would you like to be improved?

Currently, the table partition files API could be stuck for a very long time if the table has lots of files (e.g. over 100K). The root cause is that AMS gets all file entries to calculate partitions, instead of filtering the entries by partitions.

This may be due to a limitation that Iceberg Java API is not able to read the partition metadata table directly. But hopefully we could find a workaround or push Iceberg community to solve this problem.

How should we improve?

No response

Are you willing to submit PR?

  • [X] Yes I am willing to submit a PR!

Subtasks

No response

Code of Conduct

link3280 avatar Mar 13 '24 11:03 link3280

I propose to align PartitionBaseInfo with the iceberg partition metadata table, which contains the following columns:

+-------------------------------+--------------------+--------------------------------------------+--+
|           col_name            |     data_type      |                  comment                   |
+-------------------------------+--------------------+--------------------------------------------+--+
| partition                     | struct<dt:string>  |                                            |
| spec_id                       | int                |                                            |
| record_count                  | bigint             | Count of records in data files             |
| file_count                    | int                | Count of data files                        |
| position_delete_record_count  | bigint             | Count of records in position delete files  |
| position_delete_file_count    | int                | Count of position delete files             |
| equality_delete_record_count  | bigint             | Count of records in equality delete files  |
| equality_delete_file_count    | int                | Count of equality delete files             |
+-------------------------------+--------------------+--------------------------------------------+--+

That would fix the performance issue because we don't have to iterate over all the entries to count files. The complexity would be reduced from millions to thousands for large tables whose partitions contain 1k files.

However, the downside is that we have to drop the commit time and the storage size at the partition level which are calculated based on the entries.

@majin1102 @zhoujinsong @baiyangtx WDYT?

link3280 avatar Apr 11 '24 07:04 link3280

@link3280 Perhaps can expect to get it in the iceberg metadata. This information has been saved in the latest iceberg release. https://github.com/apache/iceberg/pull/8502

huyuanfeng2018 avatar Apr 12 '24 06:04 huyuanfeng2018

@link3280 Perhaps can expect to get it in the iceberg metadata. This information has been saved in the latest iceberg release. apache/iceberg#8502

Cool! Then we could still keep the partition storage size.

link3280 avatar Apr 12 '24 06:04 link3280

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Oct 10 '24 00:10 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Oct 25 '24 00:10 github-actions[bot]

As apache/iceberg#8502 was released in iceberg 1.5, and currently we use iceberg 1.4.3, maybe this depends on #3084

klion26 avatar Oct 31 '24 11:10 klion26

This issue has been unblocked as #3084 has been merged.

klion26 avatar Jan 22 '25 01:01 klion26

hi @link3280 do you still working on this issue?

klion26 avatar Mar 20 '25 07:03 klion26

hi @link3280 do you still working on this issue?

I'm afraid not.

link3280 avatar Mar 23 '25 04:03 link3280

take

turboFei avatar Dec 09 '25 17:12 turboFei