gravitino icon indicating copy to clipboard operation
gravitino copied to clipboard

[Improvement] Add Support for Inspecting Tables in Datalake Formats like Iceberg

Open theoryxu opened this issue 1 year ago • 6 comments

What would you like to be improved?

In addition to regular table functions, datalake table formats offer various capabilities for inspecting tables.

For instance, Iceberg can display valid snapshots for a table or show a table's current file manifests.

However, Gravitino catalogs currently lack this support, and there is no designated place for it in the General API hierarchy.

Incorporating this support into Gravitino would help users better manage their datalakes.

How should we improve?

No response

theoryxu avatar Aug 30 '24 02:08 theoryxu

This is something like querying metadata tables. It seems reasonable to support it in Gravitino. My concern is that it may require too many resources to produce the metadata. We could leverage K8s, but they will introduce complexity for Gravitino. @jerryshao @caican00 @shaofengshi @xunliu WDYT?

FANNG1 avatar Aug 30 '24 03:08 FANNG1

I think we can have that API design to support querying metadata table first, whether it is too costly or not depends on the underlying sources

jerryshao avatar Aug 30 '24 05:08 jerryshao

I think we have best to discuss the scope of the ability and related scenarios at first.

the issue is also related to this discussion, and imo, it seems reasonable to simply get the metadata of the metadata tables from gravitino.

However, it seems unreasonable to read data from the metadata tables through gravitino, the data should be read through the connector.

in addition, metadata tables should not support operations such as create, alter, and drop in gravitino.

caican00 avatar Aug 30 '24 07:08 caican00

Snapshots are similar to columns in which they are part of the table metadata. Whether we support modification operations, such as changing snapshots, can be determined based on user requirements.

FANNG1 avatar Aug 31 '24 07:08 FANNG1

Snapshots are similar to columns in that they are part of the table metadata. Whether we support modification operations, such as changing snapshots, can be determined based on user requirements.

why not use spark-procedures directly?

If gravitino supports modification operations on the metadata tables, for example, deleting a snapshot, the corresponding data file also needs to be deleted. If gravitino is used to perform this operation, the rest api is likely to time out.

caican00 avatar Sep 01 '24 08:09 caican00

Snapshots are similar to columns in that they are part of the table metadata. Whether we support modification operations, such as changing snapshots, can be determined based on user requirements.

why not use spark-procedures directly?

If gravitino supports modification operations on the metadata tables, for example, deleting a snapshot, the corresponding data file also needs to be deleted. If gravitino is used to perform this operation, the rest api is likely to time out.

we could query table metadata by Spark or Flink, but it's heavy and hard for normal users, with a REST interface , it's simple to query snapshots like metadata for end users or other internal systems.

FANNG1 avatar Sep 06 '24 09:09 FANNG1

Thank you for the efforts on the project! I'm following up on this issue to check if there have been any recent updates or progress. Is there an estimated timeline or plan for implementing this feature?

Looking forward to your insights. cc: @FANNG1

dataageek avatar Dec 28 '24 16:12 dataageek

I'm afraid I can't provide a timeline currently, it depends on the requirements of the community, could you provide your scenarios?

FANNG1 avatar Dec 29 '24 03:12 FANNG1

hi @FANNG1 This is my use case: After creating lakehouse-iceberg catalogs (jdbc) in Gravitino Metalake, I was able to pull the catalogs into Spark (v 3.5.4) and query the Iceberg tables. However, when I tried to query the Iceberg metadata tables, I encountered the following error:

./spark-sql -v
--conf spark.plugins="org.apache.gravitino.spark.connector.plugin.GravitinoSparkPlugin"
--conf spark.sql.gravitino.uri=http://127.0.0.1:8090
--conf spark.sql.gravitino.metalake=datalake_dev
--conf spark.sql.gravitino.enableIcebergSupport=true

spark-sql (default)> set;
spark.master    local[*]
spark.plugins   org.apache.gravitino.spark.connector.plugin.GravitinoSparkPlugin
spark.sql.catalog.iceberg_jdbc_catalog  org.apache.gravitino.spark.connector.iceberg.GravitinoIcebergCatalogSpark35
spark.sql.catalog.managed_iceberg_jdbc_catalog  org.apache.gravitino.spark.connector.iceberg.GravitinoIcebergCatalogSpark35
spark.sql.catalog.managed_iceberg_jdbc_catalog_2        org.apache.gravitino.spark.connector.iceberg.GravitinoIcebergCatalogSpark35
spark.sql.catalog.managed_iceberg_jdbc_catalog_3        org.apache.gravitino.spark.connector.iceberg.GravitinoIcebergCatalogSpark35
spark.sql.catalogImplementation hive
spark.sql.datetime.java8API.enabled     true
spark.sql.extensions    org.apache.gravitino.spark.connector.iceberg.extensions.GravitinoIcebergSparkSessionExtensions,org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.gravitino.enableIcebergSupport        true
spark.sql.gravitino.metalake    datalake_dev
spark.sql.gravitino.uri http://127.0.0.1:8090
spark.sql.hive.version  2.3.9

spark-sql (default)> select * from managed_iceberg_jdbc_catalog_3.managed_db1.managed_table1;
a
b
c
Time taken: 13.246 seconds, Fetched 3 row(s)
spark-sql (default)> select * from managed_iceberg_jdbc_catalog_3.managed_db1.managed_table1.snapshots;
select * from managed_iceberg_jdbc_catalog_3.managed_db1.managed_table1.snapshots
[TABLE_OR_VIEW_NOT_FOUND] The table or view `managed_iceberg_jdbc_catalog_3`.`managed_db1`.`managed_table1`.`snapshots` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.
To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS.; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation [managed_iceberg_jdbc_catalog_3, managed_db1, managed_table1, snapshots], [], false

Viewing the details of Iceberg metadata tables is crucial. It would be great if this could be prioritized.

dataageek avatar Dec 29 '24 12:12 dataageek

Oh, you want to querying iceberg metadata table though spark-connector, As an short solution, we could querying iceberg metadata table though underlying iceberg catalog backend, When Gravitino supports querying metadata table, we could switch the implementation to Gravitino. WDYT? @jerryshao @caican00

FANNG1 avatar Dec 30 '24 00:12 FANNG1

hi @FANNG1, thanks, I can use the underlying Iceberg JDBC catalog. It would be beneficial if the Gravitino Spark connector supports adding all registered JDBC catalogs (Iceberg jdbcCatalog) into the Spark session and make them available instead of GravitinoIcebergCatalogSpark35 catalog. This way, I could avoid setting up the catalog again or writing a new Spark plugin.

dataageek avatar Dec 31 '24 01:12 dataageek

hi @FANNG1, thanks, I can use the underlying Iceberg JDBC catalog. It would be beneficial if the Gravitino Spark connector supports adding all registered JDBC catalogs (Iceberg jdbcCatalog) into the Spark session and make them available instead of GravitinoIcebergCatalogSpark35 catalog. This way, I could avoid setting up the catalog again or writing a new Spark plugin.

Spark connector supports the JDBC catalog backend currently, but it doesn't support querying metadata tables, is this block you?

FANNG1 avatar Jan 06 '25 00:01 FANNG1

hi @FANNG1 . It's not a blocker for me. For the time being, I have created a new Spark Plugin by extending GravitinoSparkPlugin and using Iceberg's SessionCatalog instead of GravitinoIcebergCatalogSpark35. Thanks

dataageek avatar Jan 07 '25 11:01 dataageek