python-bigquery-pandas
python-bigquery-pandas copied to clipboard
Add MERGE functionality
In BigQuery it is possible to merge two tables:
MERGE DATASET.table T
USING (SELECT * FROM DATASET.another_table) S
ON T.id = S.id
WHEN MATCHED THEN
UPDATE SET value = S.value
WHEN NOT MATCHED THEN
INSERT ROW
Is it possible to add such functionality to DataFrame.to_gbq function? I.e. something like df.to_gbq(if_exists='merge', on_a='id', on_b='id')
, or even provide a possibilty to write a whole ON clause df.to_gbq(if_exists='merge', on='A.city = B.city AND A.date = B.date'
To clarify, do you mean for this to be a sort of "upsert", which matches by primary key and replaces existing values?
Yes, or, more generally, by any condition, i.e. A.city = B.city AND A.date = B.date
.
And insert, if no match by primary key / condition.
@tswast any news on this? Do you want me to have a look at this and open a PR?
I don't have time to work on it. I think it'd be best to nail down the design before opening a PR.
Besides if_exists='merge'
, we need to account for large dataframes. Anything larger than the BigQuery query length limit (currently 12 MB) will need to be either staged in a temporary table (or even GCS with use of external table feature) or split into multiple queries.
As the row size limit is currently 100 MB which is less than the query length limit, staging the data is required for the most general solution.
I'm not yet familiar with pandas-gbq codebase, so it will take me a while. In my projects, I split data into batches and merge them into the main using temporary tables. Can we just do like this? Here's the psedocode:
# we want to merge `data` to `table_id` using `condition`
data_batches = split_data_into_batches(data)
for data_batch in data_batches:
tmp_id = _id_generator()
table_tmp_id = f"{table_id}_tmp_{tmp_id}"
client.create_table(table=bigquery.Table(table_tmp_id, schema=table_id.schema)), expiration=1 hour)
query = f"""MERGE `{table_id}` T
USING `{table_tmp_id}` S
ON {condition}
WHEN MATCHED THEN
UPDATE SET ...
WHEN NOT MATCHED THEN
INSERT ROW"""
client.query(query).result()
Is being watched? I think an upsert option would be a good improvement, for now i need to delete data before a to_gbq in append mode.
@dsantiago I built a lib that does this and has a couple other handy functionalities - https://pypi.org/project/pygbq/ Code quality could be better but it does the job
@dsantiago I built a lib that does this and has a couple other handy functionalities - https://pypi.org/project/pygbq/ Code quality could be better but it does the job
Good job, i will sure check it out!
Any further work on this option? A big query upsert operation is definitely possible with their MERGE syntax. Will this get implemented anytime soon?
@violetbrina , PRs are welcome!
merge option would be awesome
I'd love to have this option, too. We have a process to consolidate specific lines of data from different files we receive daily, but everything gets duplicated if we receive the same file twice for any reason. Our only solution is to remove everything in our BQ table that's related to that file and re-process it.
upvoting for MERGE option