google-cloud-python icon indicating copy to clipboard operation
google-cloud-python copied to clipboard

add `configuration` argument to `to_gbq`

Open tswast opened this issue 4 years ago • 4 comments

The to_gbq function should take a configuration argument representing a BigQuery JobConfiguration REST API resource.

This would make it consistent with the read_gbq function.

Context

Options for table creation / schema updates

  • Partitioning and Clustering: https://github.com/googleapis/python-bigquery-pandas/issues/395
  • Schema update options: https://github.com/googleapis/python-bigquery-pandas/issues/107
  • Partition expiration time https://github.com/googleapis/google-cloud-python/issues/14493

I believe these would require table creation to be done by load job instead of a separate create table step (especially partitioning, as that must be done at creation time). TBD what this would look like if we add support for the BigQuery Storage Write API or (legacy) Streaming API.

Options for file loading

  • Custom NULL marker https://github.com/googleapis/python-bigquery-pandas/issues/366 -- this would require an update to pandas CSV write configuration as, I believe, though.

tswast avatar Nov 17 '21 16:11 tswast

Hello @tswast @chalmerlowe,

would you accept external contributions for this feature? My usage mainly lie on the clustering / partitioning, but happy to help with this PR.

Currently, I use this function to create a table

def create_partitioned_table(client, project_id, dataset_name, table_name, schema):
    # Set table_id to the ID of the table to create.
    table_id = "{}.{}.{}".format(project_id, dataset_name, table_name)
    
    # Construct a full Table object to send to the API with partitioning.
    table = bigquery.Table(table_id, schema=schema)
    
    # Set partitioning on the 'date' field.
    table.time_partitioning = bigquery.TimePartitioning(
        type_=bigquery.TimePartitioningType.DAY,  # Partition by day
        field="date"  # Specify the field to partition on
    )
    
    # Send the table to the API for creation.
    table = client.create_table(table)  # Make an API request.
    
    print(
        "Created partitioned table {}.{}.{} on column 'date'".format(
            table.project, table.dataset_id, table.table_id
        )
    )

And then load data using the pandas_gbq library. That would be nice to have everything built-in.

Let me know if you'd be open to it and I can work on the PR in October.

Thank you !

antoineeripret avatar Aug 30 '25 21:08 antoineeripret

Would love a contribution. For clustering columns in particular, let's mimic the interface that bigframes.pandas.DataFrame.to_gbq came up with:

clustering_columns: typing.Union[
        pandas.core.indexes.base.Index, typing.Iterable[typing.Hashable]
    ] = (),

clustering_columns Union[pd.Index, Iterable[Hashable]], default ()

Specifies the columns for clustering in the BigQuery table. The order of columns in this list is significant for clustering hierarchy. Index columns may be included in clustering if the index parameter is set to True, and their names are specified in this. These index columns, if included, precede DataFrame columns in the clustering order. The clustering order within the Index/DataFrame columns follows the order specified in clustering_columns.

https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.dataframe.DataFrame#bigframes_dataframe_DataFrame_to_gbq

tswast avatar Aug 31 '25 12:08 tswast

@tswast, I'll have a look in a couple of weeks then. Thanks !

antoineeripret avatar Aug 31 '25 13:08 antoineeripret

@tswast: I've worked on a draft PR here.

This would be my first contribution to one of Google's repo, hence quite new to how you operate. Any input is more than welcome to ensure that it meets your criteria.

Thanks !

antoineeripret avatar Sep 22 '25 11:09 antoineeripret