ICEBERG performance is slow when querying tables with a large number of partitions.
Query engine
spark 3.3.2 iceberg 1.3.1
Question
I've got a table of 11 billion or so. 3 terabytes. This table currently has about 400,000 partitions. The MetaData file size is 300MB. I'm currently experiencing the following problems:
- When I query the table, no matter what type of query I submit, the SQL takes a long time to commit.
- When I need to retrieve a large range of partitions, the query performance of this table is very poor. Any suggestions for my situation?
Have less partitions. Each partition is more File/S3 I/O.
You have 27,500 rows per partition which is really small. Try to target at least a few million rows per partition depending on row size.
Rusty
Rusty is right here, that’s only 7.5 mb a partition. I would aim for at least 512mb maybe more for such a large table
On Wed, Aug 9, 2023 at 10:00 PM Rusty Conover @.***> wrote:
Have less partitions. Each partition is more File/S3 I/O.
You have 27,500 rows per partition which is really small. Try to target at least a few million rows per partition depending on row size.
Rusty
— Reply to this email directly, view it on GitHub https://github.com/apache/iceberg/issues/8161#issuecomment-1672478848, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADE2YMETHMHQYNHBEIMFP3XURFDDANCNFSM6AAAAAA2ZOK7FA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Is this Big or Small data technology? Let us say table size is 300TB. With 400K partitions, average partition size is 750MB, which looks normal to me. 3PB tables? I know they exist. I think that the problem of large number of partitions, large size of the metadata must be eventually addressed by Iceberg community.
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'