dataform icon indicating copy to clipboard operation
dataform copied to clipboard

BigQuery partitioning improvements, date by hour/month/year and integer range partitions

Open db-magnus opened this issue 4 years ago • 2 comments

Dataform currently supports the default daily partitioning in BigQuery and clustering. I'd like to have support for more granularity in date partitioning and also integer range partitions.

Suggestion for syntax:

Date partitioning

config {
  type: "table"
  bigquery: {
    partitionBy: "DATE(ts)",
    granularity: "month"
    clusterBy: ["name", "revenue"]
  }

Integer range partitioning

config {
  type: "table"
  bigquery: {
    partitionBy: "INT(category)",
    range: {
      end: "100",
      interval: "10",
      start: "0"
    }
    clusterBy: ["name", "revenue"]
  }

db-magnus avatar Dec 16 '20 10:12 db-magnus

You're in luck - it already works today!

Hourly

bigquery: {
    partitionBy: "DATETIME_TRUNC(<timestamp_column>, HOUR)"
}

Monthly

bigquery: {
    partitionBy: "DATE_TRUNC(<date_column>, MONTH)"
}

Integer

bigquery: {
    partitionBy: "RANGE_BUCKET(<integer_column>, GENERATE_ARRAY(0, 1000000, 1000))"
}

Have a look at how BigQuery determines the partitioning to use based on this: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#partition_expression

astorrs avatar Dec 17 '20 06:12 astorrs

Perfect! Maybe add these examples to the documentation so it is easy to figure out how to do it?

db-magnus avatar Dec 17 '20 08:12 db-magnus

This is now in the docs! https://cloud.google.com/dataform/docs/partitions-clusters

image

Ekrekr avatar Mar 26 '24 13:03 Ekrekr