python-bigquery-pandas
python-bigquery-pandas copied to clipboard
Option to provide partition column and partition expiry time
While creating a new table using pandas, it would be nice if it can partition the table and set an partition expiry time. The python bigquery library already supports it
# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_ref = client.dataset('my_dataset')
table_ref = dataset_ref.table("my_partitioned_table")
schema = [
bigquery.SchemaField("name", "STRING"),
bigquery.SchemaField("post_abbr", "STRING"),
bigquery.SchemaField("date", "DATE"),
]
table = bigquery.Table(table_ref, schema=schema)
table.time_partitioning = bigquery.TimePartitioning(
type_=bigquery.TimePartitioningType.DAY,
field="date", # name of column to use for partitioning
expiration_ms=7776000000,
) # 90 days
table = client.create_table(table)
print(
"Created table {}, partitioned on column {}".format(
table.table_id, table.time_partitioning.field
)
)
https://cloud.google.com/bigquery/docs/creating-column-partitions
I can create a pull request, if people feel like it's something they find useful. At least in my work, we create lot of monitoring tables on bigquery using pandas, and push data to it. These tables keep growing and since we can't set the partition when a table has already been created, these tables just become too big, and expensive.
I'm open to a pull request that adds this.
Ideally, I'd like to see to_gbq
gain a configuration
parameter that takes a load job JSON configuration. (job configuration resource, load job configuration resource)
Example:
pandas_gbq.to_gbq(
df,
configuration={
"load": {
"timePartitioning": {
"type": "DAY",
"expirationMs": str(1000*60*60*24*30), # 30 days
"field": "my_timestamp_col"
}
}
}
)
One problem with the configuration
proposal is that currently pandas-gbq creates tables with calls to create_table
if the table doesn't exist, rather than letting the load job create it. I'd like to refactor to_gbq
to avoid this (unnecessary, IMO) step. Open to PRs to do that refactoring, but if you'd like to extract options from configuration
when creating the table, that might be simpler short-term.
Yeah, I had something like configuration thing in mind. Why does the expirationMs
need to be passed as a string
though?
Why does the expirationMs need to be passed as a string though?
It's a historical artifact of the BigQuery REST endpoint using an older JSON parsing implementation that only had JavaScript Number (floating point) available. Encoding it as a string allows the BigQuery REST endpoint to interpret the value as a 64-bit integer without loss of precision.
I believe an integer will be accepted, but you might lose precision for large values.
I'm open to a pull request that adds this.
Ideally, I'd like to see
to_gbq
gain aconfiguration
parameter that takes a load job JSON configuration. (job configuration resource, load job configuration resource)Example:
pandas_gbq.to_gbq( df, configuration={ "load": { "timePartitioning": { "type": "DAY", "expirationMs": str(1000*60*60*24*30), # 30 days "field": "my_timestamp_col" } } } )
One problem with the
configuration
proposal is that currently pandas-gbq creates tables with calls tocreate_table
if the table doesn't exist, rather than letting the load job create it. I'd like to refactorto_gbq
to avoid this (unnecessary, IMO) step. Open to PRs to do that refactoring, but if you'd like to extract options fromconfiguration
when creating the table, that might be simpler short-term.
Are you still open to a PR that does the refactoring? If so, I'd be interested to work on it.
I referenced this issue from https://github.com/googleapis/python-bigquery-pandas/issues/425, but will keep this open in case we don't decide to rely on the load job for table creation.
Any news about this feature ? could be very useful ! thanks !
Is this feature implemented already?