amoro [Feature]: 元数据完全支持iceberg，限制太多，建议参考databriks Unity Catalog

[Feature]: 元数据完全支持iceberg，限制太多，建议参考databriks Unity Catalog

Open melin opened this issue 1 year ago • 3 comments

Description

Unity catalog平台创建的表只支持delta，通过固定hive_catalog 接入hms，扩展Catalog，可以引入jdbc 等其他数据源，实现sql 跨源计算能力。 https://docs.databricks.com/release-notes/unity-catalog/20220825.html

Use case/motivation

No response

Describe the solution

No response

Related issues

No response

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Sep 09 '22 09:09 melin

@melin Thanks a lot for proposing this feature. Could you please add more information to describe what you want to propose, like:

What process engine (Flink/Spark/Trino) you want to support ?
What table format（Iceberg/Delta/MySQL）you want to support ?
What is your application situation?

You can also join our wechat user group and discuss with us about this feature.

Sep 13 '22 05:09 zhoujinsong

@melin Thanks a lot for proposing this feature. Could you please add more information to describe what you want to propose, like:

What process engine (Flink/Spark/Trino) you want to support ?

What table format（Iceberg/Delta/MySQL）you want to support ?

What is your application situation?

You can also join our wechat user group and discuss with us about this feature.

arctic 基于iceberg 构建数据湖平台，没有使用hms，如果老的业务已经用hive，怎么兼容hive 数据？把hive 中数据迁移到arctic，iceberg 存储不现实。Unity catalog 提供了一个解决方案，引入Hive_catalog，可以访问老的hive 数据。对于外部关系型数据库以及传统数据仓库（mysql, oracle, guass，gp），一般通过api 方式访问数据，对数据分析人员使用成本比较大，所以很多选择数据同步到hive方式，增加使用成本。可以考虑实现spark sql 跨源访问数据能力。先配置数据源，选择导入表元数据到arctic。spark driver 动态注册spark catalog:

// 三级命名空间访问表。 select * from superior.superior_test.meta_job

Sep 13 '22 07:09 melin

@melin Thanks a lot for proposing this feature. Could you please add more information to describe what you want to propose, like:

What process engine (Flink/Spark/Trino) you want to support ?

What table format（Iceberg/Delta/MySQL）you want to support ?

What is your application situation?

You can also join our wechat user group and discuss with us about this feature.

arctic 基于iceberg 构建数据湖平台，没有使用hms，如果老的业务已经用hive，怎么兼容hive 数据？把hive 中数据迁移到arctic，iceberg 存储不现实。Unity catalog 提供了一个解决方案，引入Hive_catalog，可以访问老的hive 数据。对于外部关系型数据库以及传统数据仓库（mysql, oracle, guass，gp），一般通过api 方式访问数据，对数据分析人员使用成本比较大，所以很多选择数据同步到hive方式，增加使用成本。可以考虑实现spark sql 跨源访问数据能力。先配置数据源，选择导入表元数据到arctic。spark driver 动态注册spark catalog:

// 三级命名空间访问表。 select * from superior.superior_test.meta_job

@melin 你好，非常感谢你的回复！你提到的场景和问题的确非常有代表，Arctic开发组也遇到了类似的问题，我们现在的思路是：

从 V0.3.1 开始 Arctic 支持了 Hive 表，可以导入已有的 Hive 集群将已有的 Hive 表升级到 Arctic 进行管理，这些表同时能享受到 Arctic 提供的流批一体与实时更新的新特性，具体可以参考这里：https://arctic.netease.com/ch/table-format/hive-format/。
多 Catalog 的联合查询的确是非常有用的功能，不过现在实现思路一般都是计算引擎侧（如 Spark/Flink）来实现，Arctic 在这个过程中可以充当统一的元数据中心，将这些数据源的信息注册到 AMS ，然后 Arctic 提供统一的 Connector 来访问这些数据源，不过这个特性展示还在讨论阶段。

另外再次邀请你加入到 Arctic 的微信用户群，一起探讨 LakeHouse 相关话题。

Sep 13 '22 07:09 zhoujinsong

Closed as not updated for a long time.

Nov 30 '23 07:11 zhoujinsong

amoro amoro copied to clipboard

[Feature]: 元数据完全支持iceberg，限制太多，建议参考databriks Unity Catalog

Description

Use case/motivation

Describe the solution

Related issues

Are you willing to submit a PR?

Code of Conduct

amoro
amoro copied to clipboard