dataform
dataform copied to clipboard
BigQuery partitioning improvements, date by hour/month/year and integer range partitions
Dataform currently supports the default daily partitioning in BigQuery and clustering. I'd like to have support for more granularity in date partitioning and also integer range partitions.
Suggestion for syntax:
Date partitioning
config {
type: "table"
bigquery: {
partitionBy: "DATE(ts)",
granularity: "month"
clusterBy: ["name", "revenue"]
}
Integer range partitioning
config {
type: "table"
bigquery: {
partitionBy: "INT(category)",
range: {
end: "100",
interval: "10",
start: "0"
}
clusterBy: ["name", "revenue"]
}
You're in luck - it already works today!
Hourly
bigquery: {
partitionBy: "DATETIME_TRUNC(<timestamp_column>, HOUR)"
}
Monthly
bigquery: {
partitionBy: "DATE_TRUNC(<date_column>, MONTH)"
}
Integer
bigquery: {
partitionBy: "RANGE_BUCKET(<integer_column>, GENERATE_ARRAY(0, 1000000, 1000))"
}
Have a look at how BigQuery determines the partitioning to use based on this: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#partition_expression
Perfect! Maybe add these examples to the documentation so it is easy to figure out how to do it?
This is now in the docs! https://cloud.google.com/dataform/docs/partitions-clusters