amoro icon indicating copy to clipboard operation
amoro copied to clipboard

[Improvement]: reduce the impact of the listTables method in a unified catalog on the Hive Metastore (HMS)

Open Aireed opened this issue 1 year ago • 2 comments

Search before asking

  • [X] I have searched in the issues and found no similar issues.

What would you like to be improved?

  • metastore: hive
  • catalog: UnifiedCatalog
  • method: listTables

problem:

  • When calling mixed-hive/iceberg catalog listTables, it first calls the getAllTables method of HMS to retrieve all table names, and then calls getTableObjectsByName to get the Table objects of these tables, and determines whether the current table is a mixed-hive/iceberg table by checking properties or getSd().
  • The execution logic of Paimon is to first getAllTables to retrieve all tables, then use getTable to get the Table object of each table, and determine whether this table is a Paimon table.

As mentioned above, if Unified Catalog supports mixed-hive/iceberg/paimon simultaneously, it will call getTables three times, getTableObjectsByName twice (which is a relatively heavy operation), and multiple times getTable.

In addition to being accessed by the frontend to view the table list, the listTables will also be called by the logic to synchronize with the external catalog (default every 3 minutes).

How should we improve?

For the case where the metastore is Hive, we optimize by calling getAllTables and getTableObjectsByName once to retrieve all tables and their types.

  1. Define an interface that supports listing all tables and their formats.
  2. MixedCatalog implements this interface.
  3. MixedHiveCatalog implements this interface.
  4. when call UnifiedCatalog::listTables, we first check the supported FormatCatalog to see if any of them have implemented this interface. If so, we use the table list returned by it instead of calling listTables for each type of FormatCatalog.

Are you willing to submit PR?

  • [X] Yes I am willing to submit a PR!

Subtasks

No response

Code of Conduct

Aireed avatar Jul 01 '24 09:07 Aireed

@baiyangtx @zhoujinsong WDYT?

Aireed avatar Jul 01 '24 09:07 Aireed

The implementation is roughly like this. (The code is quite old, ArcticCatalog has been replaced with MixedHiveCatalog now). image

Aireed avatar Jul 01 '24 09:07 Aireed

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Dec 29 '24 00:12 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Jan 13 '25 00:01 github-actions[bot]

do we still need to improve this? it seems that the CommmonUnifiedCatalog#listTables will call each format's listTables, and only ask for the external catalog once for all the tables(listTables)

klion26 avatar Jan 13 '25 01:01 klion26