hudi
hudi copied to clipboard
[HUDI-9528] Support database and table name for Glue/ Datahub catalog
Change Logs
Added separate configs for glue and datahub to set database/table name in sync client.
Impact
Hudi database/table name can be configured for glue/datahub catalog separately.
Risk level (write none, low medium or high below)
If medium or high, explain what verification was done to mitigate the risks.
Documentation Update
hoodie.datasource.meta.sync.glue.database_name: "database"
hoodie.datasource.meta.sync.glue.table_name: "table"
hoodie.meta.sync.datahub.database.name: "database"
hoodie.meta.sync.datahub.table.name: "table"
Contributor's checklist
- [ ] Read through contributor's guide
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
https://issues.apache.org/jira/browse/HUDI-9528 @vineethNaroju Use this JIRA for your PR.
@hudi-bot run azure
Added separate configs for glue and datahub to set database/table name in sync client.
@vineethNaroju can you explain why we need a new options key for the db/table name even though the existing options already work?
Added separate configs for glue and datahub to set database/table name in sync client.
@vineethNaroju can you explain why we need a new options key for the db/table name even though the existing options already work?
@danny0405 We support database/table names being different for other catalogs/metastores like BigQuery for example. The restriction for user right now is that for Glue/DataHub, it always gets created with hoodie.table.name
https://github.com/apache/hudi/blob/master/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java#L83
it always gets created with
hoodie.table.name
@vinishjail97 I have no access to the link, I see there are already some options like hoodie.gcp.bigquery.sync.table_name in the BigQuerySyncConfig on master: https://github.com/apache/hudi/blob/f1faabe2f577d7f33fdb0194a490e7c18b22546c/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java#L83
it always gets created with
hoodie.table.name@vinishjail97 I have no access to the link, I see there are already some options like
hoodie.gcp.bigquery.sync.table_namein theBigQuerySyncConfigon master:https://github.com/apache/hudi/blob/f1faabe2f577d7f33fdb0194a490e7c18b22546c/hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySyncConfig.java#L83
Yes, we want to have similar config for glue and datahub catalog.
Yes, we want to have similar config for glue and datahub catalog.
That's okay, can we add similiar inference logic just in the config option so that we only need to change the specific sync tool:
public static final ConfigProperty<String> BIGQUERY_SYNC_TABLE_NAME = ConfigProperty
.key("hoodie.gcp.bigquery.sync.table_name")
.noDefaultValue()
.withInferFunction(cfg -> Option.ofNullable(cfg.getString(HOODIE_TABLE_NAME_KEY))
.or(() -> Option.ofNullable(cfg.getString(HOODIE_WRITE_TABLE_NAME_KEY))))
.markAdvanced()
.withDocumentation("Name of the target table in BigQuery");
yes, I agree. we should have inference logic. if catalog specific db and table names are overridden, we can take it from there. if not, we should fallback to the generic db and table name. I will work on addressing the feedback
hey @danny0405 : patch is ready for review.
@danny0405 : can you review this patch.
CI report:
- 680de0da1a0fff4b1313e71dc3a0462402837765 Azure: SUCCESS
Bot commands
@hudi-bot supports the following commands:@hudi-bot run azurere-run the last Azure build
Landed as part of https://github.com/apache/hudi/pull/13785