dataproc-templates icon indicating copy to clipboard operation
dataproc-templates copied to clipboard

[Spike] [BQ Partitioning] Explore how xyz to BQ template(s) can allow Partitioning and Clustering

Open shashank-google opened this issue 2 years ago • 3 comments

Option 1 - Test and verify User would manually create empty table in BQ with partitioning and clustering. The xyz to BQ template will then move data into it. What if sequence of columns in source avro / jdbc / hive etc do not match with existing table in BigQuery.

Option 2 - Explore If BQ table does not exists (or overwrite flag is supplied), then how can template automatically determine clustering and partitioning. Look at corresponding Dataflow templates for ideas.

shashank-google avatar Sep 13 '22 17:09 shashank-google

Tested the following templates for Option 1 and the results are as follows :

  1. GCStoBQ template : When this template is run with the destination table and source data having a different sequence of columns (using parquet files) , the data insertion faces no issues. Other data formats are being tested.
  2. JDBCtoBQ template : When this template is run with the destination table and source table having different sequence of columns, the data insertion faces no issues.
  3. HIVEtoBQ template : Testing in progress.

PoulamiR1994 avatar Dec 19 '22 06:12 PoulamiR1994

Testing in progress for HIVEtoBQ template. For part 2 of the description, we can utilise partition and clustering attributes from spark-bigquery connector. Additional checks can be made for partitioning field, only when it is date/datetime/similar field we can continue partitioning given bigquery constraints.

ritika-neema avatar Jan 31 '23 06:01 ritika-neema

Dependent on child issues #631 #632

ritika-neema avatar Mar 28 '23 14:03 ritika-neema