hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] BQ synch tool not working with HUDI bundle jar

Open masthanmca opened this issue 1 year ago • 5 comments

Tips before filing an issue

  • Have you gone through our FAQs? yes

  • Join the mailing list to engage in conversations and get faster support at [email protected].

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced BQ sync is not working with hudi bundle jar A clear and concise description of the problem. I wanted to enable BQ sync while writing ingest the data into HUDI table using manifest file. To Reproduce

Steps to reproduce the behavior:

  1. create data frame with any schema
  2. use the below options for Bq sync along with the other default HUDI configurations
  3.     hiveConfigs.put("org.apache.hudi.gcp.bigquery.BigQuerySyncTool", "true")
     hiveConfigs.put("hoodie.gcp.bigquery.sync.project_id", bqSyncProjectId)
     hiveConfigs.put("hoodie.gcp.bigquery.sync.dataset_name", bqSyncDatasetName)
     hiveConfigs.put("hoodie.gcp.bigquery.sync.table_name", hoodieHiveSyncTable)
     hiveConfigs.put("hoodie.gcp.bigquery.sync.dataset_location", "us")
     hiveConfigs.put("hoodie.gcp.bigquery.sync.source_uri", bqSyncSourceUri)
     hiveConfigs.put("hoodie.gcp.bigquery.sync.source_uri_prefix", bqSyncSourceUriPrefix)
     hiveConfigs.put("hoodie.gcp.bigquery.sync.base_path", bqSyncBasePath)
     hiveConfigs.put("hoodie.gcp.bigquery.sync.partition_fields", hoodieHiveSyncPartitionFields)
     hiveConfigs.put("hoodie.gcp.bigquery.sync.use_bq_manifest_file", "true")
    
  4. write the data frame in HUDI table.ds.write.format(HudiFormat).options(hoodieConfigs).options(hiveConfigs).mode(writeMode).save(location)

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version : 0.14.0

  • Spark version : 3.3.2

  • Hive version :

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) : GCS

  • Running on Docker? (yes/no) :no

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

No error , but external table not created in Big Query

masthanmca avatar Feb 06 '24 12:02 masthanmca

@masthanmca Is the the first time you are facing this issue or after upgrade you started facing this one.

Your configurations also looks wrong? From where you got these or which doc you referred? can you refer - https://hudi.apache.org/docs/gcp_bigquery/

ad1happy2go avatar Feb 06 '24 14:02 ad1happy2go

Facing the same issue , does not work with org.apache.hudi:hudi-spark3.3-bundle_2.12:0.14.1 .

Hudi Write to path works , Hive Sync works but BQ sync does not work.

For now have taken this route based on a flag to manually perform the BQSync with BQSyncTool post the dataframe.write

https://github.com/apache/hudi/issues/9355#issuecomment-1696764242

abhishekshenoy avatar Feb 19 '24 04:02 abhishekshenoy

@abhishekshenoy @masthanmca That (https://github.com/apache/hudi/issues/9355#issuecomment-1696764242) i.e. BigQuerySyncTool is the correct way of doing BQ sync with batch jobs.

The another way is doing this with HudiStreamer.

ad1happy2go avatar Feb 19 '24 12:02 ad1happy2go

@ad1happy2go @the-other-tim-brown

But should nt that be internally called when we are providing the Hudi Bq 
configs and enabling META_SYNC_ENABLED. 

In my case we use df.write.options(hudiAndHiveAndBQConfigs).save() and 
the hudiAndHiveAndBQConfigs has both hive and bq related configs . 

*But still only hive sync happens implicitly*. 

Is it by design that as part of our write function we need to perform both 

df.write.options(hudiAndHiveAndBQConfigs).save()
new BigQuerySyncTool(getBigQueryProps).syncHoodieTable()

abhishekshenoy avatar Feb 20 '24 04:02 abhishekshenoy

@masthanmca @abhishekshenoy I went through the code and identified that we need to set both the class names to do both metastync together. The default value for below prop is just hive sync. I tried with 0.14.1 hudi version and after write and hive sync completed, it tried to do Big query sync also.

"hoodie.meta.sync.client.tool.class" : "org.apache.hudi.hive.HiveSyncTool,org.apache.hudi.gcp.bigquery.BigQuerySyncTool"

ad1happy2go avatar Feb 22 '24 08:02 ad1happy2go

@masthanmca Closing out this issue as I confirmed it works. Please reopen in case you still see this issue.

ad1happy2go avatar Feb 27 '24 15:02 ad1happy2go