python-bigquery-pandas icon indicating copy to clipboard operation
python-bigquery-pandas copied to clipboard

Add MERGE functionality

Open ZiggerZZ opened this issue 4 years ago • 13 comments

In BigQuery it is possible to merge two tables:

MERGE DATASET.table T
USING (SELECT * FROM DATASET.another_table) S
ON T.id = S.id
WHEN MATCHED THEN
	UPDATE SET value = S.value
WHEN NOT MATCHED THEN
	INSERT ROW

Is it possible to add such functionality to DataFrame.to_gbq function? I.e. something like df.to_gbq(if_exists='merge', on_a='id', on_b='id'), or even provide a possibilty to write a whole ON clause df.to_gbq(if_exists='merge', on='A.city = B.city AND A.date = B.date'

ZiggerZZ avatar Aug 12 '20 08:08 ZiggerZZ

To clarify, do you mean for this to be a sort of "upsert", which matches by primary key and replaces existing values?

tswast avatar Nov 06 '20 15:11 tswast

Yes, or, more generally, by any condition, i.e. A.city = B.city AND A.date = B.date. And insert, if no match by primary key / condition.

ZiggerZZ avatar Nov 06 '20 17:11 ZiggerZZ

@tswast any news on this? Do you want me to have a look at this and open a PR?

ZiggerZZ avatar Dec 13 '20 18:12 ZiggerZZ

I don't have time to work on it. I think it'd be best to nail down the design before opening a PR.

Besides if_exists='merge', we need to account for large dataframes. Anything larger than the BigQuery query length limit (currently 12 MB) will need to be either staged in a temporary table (or even GCS with use of external table feature) or split into multiple queries.

As the row size limit is currently 100 MB which is less than the query length limit, staging the data is required for the most general solution.

tswast avatar Dec 14 '20 15:12 tswast

I'm not yet familiar with pandas-gbq codebase, so it will take me a while. In my projects, I split data into batches and merge them into the main using temporary tables. Can we just do like this? Here's the psedocode:

# we want to merge `data` to `table_id` using `condition`
data_batches = split_data_into_batches(data)
for data_batch in data_batches:
    tmp_id = _id_generator()
    table_tmp_id = f"{table_id}_tmp_{tmp_id}"
    client.create_table(table=bigquery.Table(table_tmp_id, schema=table_id.schema)), expiration=1 hour)
    query = f"""MERGE `{table_id}` T
USING `{table_tmp_id}` S
ON {condition}
WHEN MATCHED THEN
	UPDATE SET ...
WHEN NOT MATCHED THEN
	INSERT ROW"""
    client.query(query).result()

ZiggerZZ avatar Dec 14 '20 23:12 ZiggerZZ

Is being watched? I think an upsert option would be a good improvement, for now i need to delete data before a to_gbq in append mode.

dsantiago avatar Jun 22 '21 19:06 dsantiago

@dsantiago I built a lib that does this and has a couple other handy functionalities - https://pypi.org/project/pygbq/ Code quality could be better but it does the job

ZiggerZZ avatar Jun 23 '21 08:06 ZiggerZZ

@dsantiago I built a lib that does this and has a couple other handy functionalities - https://pypi.org/project/pygbq/ Code quality could be better but it does the job

Good job, i will sure check it out!

dsantiago avatar Jun 23 '21 12:06 dsantiago

Any further work on this option? A big query upsert operation is definitely possible with their MERGE syntax. Will this get implemented anytime soon?

violetbrina avatar Apr 07 '22 07:04 violetbrina

@violetbrina , PRs are welcome!

parthea avatar Apr 10 '22 00:04 parthea

merge option would be awesome

fanarat avatar Apr 30 '22 08:04 fanarat

I'd love to have this option, too. We have a process to consolidate specific lines of data from different files we receive daily, but everything gets duplicated if we receive the same file twice for any reason. Our only solution is to remove everything in our BQ table that's related to that file and re-process it.

atapia238 avatar Aug 31 '22 16:08 atapia238

upvoting for MERGE option

pedrohml avatar Sep 30 '23 23:09 pedrohml