sqlalchemy-bigquery Full-fledged SQLAlchemy support

For using software that uses SQLAlchemy as backend (e.g. https://github.com/airbnb/caravel), full-fledged SQLAlchemy support is needed.

To developers: Are there plans for further development of this repo?

Apr 04 '16 06:04 Froskekongen

I'm interested in helping out where I can, but I no longer use BQ in my day to day.

Do you know what else is needed for this dialect to help you get to where you want?

Jun 03 '16 21:06 cpdean

hey @Froskekongen did you managed to use bq with caravel?

Jul 26 '16 14:07 szymonm

I think that having basic functionality to run queries is the obvious first step. I am unfamiliar with how this is done in sqlalchemy, though - could anyone share pointers on where this functionality can be 'filled in'? That way I could help out by implementing some basic capability

Aug 04 '16 17:08 rlanda

does anyone know of a good python odbc driver for BigQuery? could probably think about incorporating that and the dialect would reflect patterns in the official dialects.

Aug 05 '16 16:08 cpdean

Afaik there are two: simba and cdata. They don't seem to be open source.

Given that the synthesis of SQL seems to be working, is it not a case that the main functionality required is the ability to submit the queries and return the results? In general this is not super difficult so I think at least a subset of the functionality could be straightforward to implement.

Aug 15 '16 13:08 rlanda

Yeah I haven't looked into it. Does Big Query even support regular database connections? every client I used had you submit a job and then poll for results afterwards.

Aug 15 '16 13:08 cpdean

Hey,

Not really - streaming queries are the closest in that they are expected to be serviced when received (rather than batched). Even so the concept of a connection would be much weaker, as it would apply basically to creating an object that can then later be used to authenticate with Google. However, each query is essentially independent of all others; BigQuery does not support per-session state as provided by e.g. postgres.

WRT to your question, it is correct - it is necessary to poll BigQuery for results, and if the result set is large, it becomes necessary to retrieve individual pages. Of course it would be possible to abstract this away, only returning when the results are available. The problem is of course that this would be slow.

I think that the match between BigQuery and a full fledged database is definitely not perfect and it may be that only a subset of functionality is implementable (e.g. using BigQuery as an ORM sounds exceptionally painful). Nevertheless, it would seem a useful thing to have and it may provide inspiration for other sql-like systems that are nevertheless not database-like (e.g. connections per-se are not supported; batched/polled behaviour needs to be abstracted, etc).

On Mon, Aug 15, 2016 at 2:21 PM, Conrad [email protected] wrote:

Yeah I haven't looked into it. Does Big Query even support regular database connections? every client I used had you submit a job and then poll for results afterwards.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cpdean/sqlalchemy-bigquery/issues/6#issuecomment-239799803, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOruCQcfJLO1kCadDTU1IvF_Bt-RJbhks5qgGfPgaJpZM4H-0uO .

Raul Landa | Data Scientist

Aug 15 '16 13:08 rlanda

Another issue that may be relevant is that it may be a good idea to check whether the current SQL query rendering is compatible with the 'standard' BigQuery SQL dialect, rather than the 'legacy' - this is clarified in the docs. This is particularly important for queries operating over wildcard tables.

Aug 15 '16 13:08 rlanda

Something I'm particularly concerned about is how to deal with the JOIN EACH and GROUP EACH BY resource hints you apply in your query.

Maybe a ton of of engine work could be done so that the metadata object holds not only information about your schemas but also the sizes that a given join/groupby needs to operate over to intelligently add them but that's something that would absolutely have to happen if connections were implemented.

If this feature wasn't present, users would run into errors on anything larger than trivial sizes and would lose all value in this dialect abstracting things away from them. I'd rather it just fit into the toolset as a means of generating sql so that other more mature libraries handle the bq client stuff.

Aug 17 '16 22:08 cpdean

Hey,

BigQuery gets normal analytic functions when using standard SQL. This means that 'EACH', is no longer needed, conventional 'OVER' and 'PARTITION BY' will suffice:

https://cloud.google.com/bigquery/sql-reference/query-syntax#analytic-functions

So this is not a problem anymore! And this of course means that we can re-use more of the previous parser constructs.

Thanks for your responses, Raul

On Wed, Aug 17, 2016 at 11:40 PM, Conrad [email protected] wrote:

Something I'm particularly concerned about is how to deal with the JOIN EACH and GROUP EACH BY resource hints you apply in your query.

Maybe a ton of of engine work could be done so that the metadata object holds not only information about your schemas but also the sizes that a given join/groupby needs to operate over to intelligently add them but that's something that would absolutely have to happen if connections were implemented.

If this feature wasn't present, users would run into errors on anything larger than trivial sizes and would lose all value in this dialect abstracting things away from them. I'd rather it just fit into the toolset as a means of generating sql so that other more mature libraries handle the bq client stuff.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cpdean/sqlalchemy-bigquery/issues/6#issuecomment-240571310, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOruDC3m5wY3udvY2aIG-4XGi4mZuZ3ks5qg43tgaJpZM4H-0uO .

Raul Landa | Data Scientist

Aug 19 '16 10:08 rlanda

@Froskekongen, @rlanda

I'm not able to work on this as I do not use BigQuery anymore, nor do I have accounts for it. I feel like there may be tricky things that come up in supporting this, such as (and remember I'm speaking about what I know about BQ from a year and a half ago) large result sets not being supported without first saving those results to cloudstorage and downloading the files from there.

As a user, if I was using something that promised to be a seamless sql-database connector for a thing that holds data, and it would just not return data and instead tell me where to download it myself, I would be pretty pissed.

If a SQLAlchemy engine for BigQuery is something people want, these caveats either have to be built into it or it needs to not pretend that queries can be used seamlessly with large result sets and build some stuff into it that'll guide users with informative error messages.

I want to provide the support that I can, but this will be limited to helping out with offline testing and publishing merged work to PyPI. I cannot test SQA engine features, and even if I got an account temporarily, the quality of SQA engine support will suffer since I don't use BQ -- I won't be properly dog-fooding the lib.

Is anyone interested in working on this? Does anyone need this feature anymore?

Sep 17 '16 22:09 cpdean

Hey,

Whether you need to export to cloud storage or not is immaterial, as the library could abstract this away. It is not necessary, by the way.

WRT your involvement, I understand. I am working on this on my own. If and when I have something to release maybe you could help as you suggested testing and with PyPI. I think the amount of code that could be reused from the present library is negligible.

Thanks, Raul

On Sat, Sep 17, 2016 at 11:58 PM, Conrad [email protected] wrote:

@Froskekongen https://github.com/Froskekongen, @rlanda https://github.com/rlanda

I'm not able to work on this as I do not use BigQuery anymore, nor do I have accounts for it. I feel like there may be tricky things that come up in supporting this, such as (and remember I'm speaking about what I know about BQ from a year and a half ago) large result sets not being supported without first saving those results to cloudstorage and downloading the files from there.

As a user, if I was using something that promised to be a seamless sql-database connector for a thing that holds data, and it would just not return data and instead tell me where to download it myself, I would be pretty pissed.

If a SQLAlchemy engine for BigQuery is something people want, these caveats either have to be built into it or it needs to not pretend that queries can be used seamlessly with large result sets and build some stuff into it that'll guide users with informative error messages.

I want to provide the support that I can, but this will be limited to helping out with offline testing and publishing merged work to PyPI. I cannot test SQA engine features, and even if I got an account temporarily, the quality of SQA engine support will suffer since I don't use BQ -- I won't be properly dog-fooding the lib.

Is anyone interested in working on this? Does anyone need this feature anymore?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cpdean/sqlalchemy-bigquery/issues/6#issuecomment-247813238, or mute the thread https://github.com/notifications/unsubscribe-auth/ABOruLum5Rqyxg0T5KNvv4RQwEiE3OLNks5qrHCTgaJpZM4H-0uO .

Raul Landa | Data Scientist

Sep 18 '16 14:09 rlanda

sqlalchemy-bigquery sqlalchemy-bigquery copied to clipboard

Full-fledged SQLAlchemy support

sqlalchemy-bigquery
sqlalchemy-bigquery copied to clipboard