dbt_artifacts icon indicating copy to clipboard operation
dbt_artifacts copied to clipboard

Support BigQuery

Open yu-iskw opened this issue 3 years ago • 19 comments

The dbt package is awesome. I would like to do the same thing for BigQuery. However, I am not sure we can realize the same with only run_query, because for instance we have to upload artifacts files to GCS in the case of BigQuery. As far as I know, there is no statement type to put it to GCS or BigQuery directly.

NOTE

Unfortunately, there is nothing like DDL to load data from local. It would be worthwhile asking that to Google Cloud.

  • https://cloud.google.com/bigquery/docs/batch-loading-data#bq

yu-iskw avatar Mar 11 '21 22:03 yu-iskw

I'm curious also about supporting other sources. Is the copy of the raw data the primary blocker? Or said another way, are there any other platform-specific code which could not be written generically?

And for the copy operation, any chance could we trick/hijack the seed/snapshot capability to get a similar result. There must be a generic api layer for data ingestion in order to make seed capabilities work across platforms. (I know this is a longshot, and probably not a critical priority, but I'm curious to others' thoughts on this.) Thanks!

aaronsteers avatar Jun 10 '21 18:06 aaronsteers

My initial thoughts are that we could extend the adapter for each warehouse to add an upload function which is accessible through a {% do adapter.upload_file(file_path, destination) %} or similar. That function can then just call any API methods needed to load the data into the warehouse.

From @jtcohen6:

dbt-bigquery connects using google's python clients, and I know those also support uploading JSON files. I tried doing this way back when, hacking into some methods intended for seeds, and got pretty close. I do think it requires need a few (probably simple) changes to the adapter code.

Brooklyn Data can take this on in the next few weeks, or contributions are very welcome! See https://github.com/dbt-labs/dbt-bigquery.

NiallRees avatar Jan 11 '22 11:01 NiallRees

From @jtcohen6:

dbt-bigquery connects using google's python clients, and I know those also support uploading JSON files. I tried doing this way back when, hacking into some methods intended for seeds, and got pretty close. I do think it requires need a few (probably simple) changes to the adapter code.

@NiallRees Do you have a link for this quote? I'm interested in seeing if there is more background to this statement and my googling skills have failed me when I tried to find this quote.

pgoslatara avatar Feb 06 '22 19:02 pgoslatara

@NiallRees Do you have a link for this quote? I'm interested in seeing if there is more background to this statement and my googling skills have failed me when I tried to find this quote.

Hi, this was in a message so I don't. You could definitely ask for some pointers over on the issue in dbt-bigquery though! https://github.com/dbt-labs/dbt-bigquery/issues/102

NiallRees avatar Feb 06 '22 21:02 NiallRees

It looks like the macro to upload files to BigQuery has been completed and will be included in dbt v1.1 . Maybe we can start working on a port now?

tuftkyle avatar Mar 22 '22 18:03 tuftkyle

@tuftkyle Thanks for the interest in this! I've been looking into this issue for a few weeks (I'm the author MR you reference), I think the next step is still within dbt-bigquery, the artifacts produced by dbt are JSON whereas BigQuery supports NDJSON files for uploads (mentioned in this comment). I'm not sure how to convert JSON to NDJSON using python, I'm also not sure how to handle periods (".") in the keys in manifest.json as BigQuery does not support column names with periods. Maybe we can replace periods with double underscores, or remove them? Welcome opinions on this as I don't know the best approach!

pgoslatara avatar Mar 23 '22 09:03 pgoslatara

Hey @pgoslatara - the artifacts produced by dbt can be considered newline delimited, they just have one object per file, so we should be good to go there. On the periods (".") in the keys, that is fine as long as we use the JSON subscript operator in BigQuery e.g. manifest['key.with.periods'].

To go about this, I'd suggest first attempting to upload the artifacts into a BigQuery table (or table for each artifact type if needed), and from there working out how to get them into the format required by https://github.com/brooklyn-data/dbt_artifacts/blob/main/models/staging/stg_dbt__artifacts.sql. After that, it should just be a case of making the models compatible with both BigQuery and Snowflake.

I'm happy to be as involved as required so please let me know if you'd like more help :)

NiallRees avatar Mar 23 '22 19:03 NiallRees

@NiallRees I've submitted MR153 for this, currently working through some suggested changes.

Can you elaborate on the "as long as we use the JSON subscript operator" element of your comment? I'm not sure I follow this fully. If an artifact is uploaded as a STRING I understand that this can be converted to a JSON data type (although this is still in preview and requires enrollment to access). The blocker I see is that after this conversion there is still a period in the key name and I'm not sure if BigQuery can handle that (as I don't have access to these features yet).

pgoslatara avatar Apr 12 '22 06:04 pgoslatara

Hey again all - we've been busy reimplementing the package, opening it up to be compatible with more adapters, without having to come up with a warehouse specific way of uploading the artifacts - by avoiding the artifacts altogether. We now are uploading the graph and results context variables.

In the new world, adding BigQuery compatibility involves implementing a BigQuery version of each warehouse dispatch-able macro defined in https://github.com/brooklyn-data/dbt_artifacts/tree/main/macros. Let us know if you'd be interested in contributing those changes!

NiallRees avatar Jul 21 '22 21:07 NiallRees

Awesome, I will be working on making it compatible for BigQuery and will start working on it tomorrow. Anyone up for pair programming on this, please let me know.

charles-astrafy avatar Jul 22 '22 07:07 charles-astrafy

How are you getting on here @charles-astrafy? Shout if I can help at all!

NiallRees avatar Aug 01 '22 12:08 NiallRees

Sorry have been sidetracked a bit with other stuff. I have time to work on it tomorrow and will give an update tomorrow EOD.

charles-astrafy avatar Aug 01 '22 18:08 charles-astrafy

All good @charles-astrafy no pressure from over here!

NiallRees avatar Aug 01 '22 20:08 NiallRees

@NiallRees started to work on it as of today. Will keep you posted and will make time on a daily basis in the coming days. Might need your guidance/help if I encounter some blockers but all good at the moment.

charles-astrafy avatar Aug 02 '22 16:08 charles-astrafy

Awesome @charles-astrafy!

NiallRees avatar Aug 02 '22 16:08 NiallRees

@NiallRees Just a small update. I have been doing good progress but quite some refactoring needed to make it work with BigQuery. For instance the upload macros are at the moment generic with "SELECT .... FROM values ( ...) , ( ...)". This syntax does not work for BigQuery so I am adding a dispatch abstraction layer for those macros. I should have a pull request ready by Friday.

charles-astrafy avatar Aug 10 '22 08:08 charles-astrafy

Sounds great @charles-astrafy, appreciate your efforts! Really looking forward to seeing how you get it working.

NiallRees avatar Aug 10 '22 08:08 NiallRees

clicked the wrong button 🙈

NiallRees avatar Aug 10 '22 08:08 NiallRees

@NiallRees --- Pull request done.

https://github.com/brooklyn-data/dbt_artifacts/pull/172

charles-astrafy avatar Aug 12 '22 14:08 charles-astrafy

Hi all! Sorry for jumping in! This could be very helpful as well for me :D

Is this only needing review? is it fully functional already @charles-astrafy?

Thanks!

adrpino avatar Aug 24 '22 10:08 adrpino