scio icon indicating copy to clipboard operation
scio copied to clipboard

Consider using google-cloud-bigquery library instead of google-api-services-bigquery

Open clairemcginty opened this issue 7 years ago • 7 comments

Google documentation recommends using the client library google-cloud-bigquery rather than the API library google-api-services-bigquery.

Pros

  • google-cloud-bigquery uses typed protobuf API request/response params rather than plain strings/ints, implicitly handles transport layer configurations, and potentially improves performance by making direct RPC calls rather than JSON over HTTP. We've seen intermittent failures with the scio-bigquery IT suite due to network timeouts, which might be solved by migrating.
  • the google-api-services-bigquery library is in maintenance mode and aside from critical bug fixes, won't have any new features added.

Cons

  • Unfortunately, the data models are quite different, and the three classes from the API library that we publicly expose in Scio - TableSchema, TableReference, and TableRow - map to Schema, TableId, and FieldValueList in the client library. So, if we end up migrating, we'd have to decide whether to change the externally facing Scio API or handle those conversions ourselves in private methods. I have a WIP branch for this migration I'll link to as soon as it's cleaned up.
  • While I was developing that branch I found an issue with the client library that breaks cross-project extraction jobs: https://github.com/googleapis/google-cloud-java/issues/3924 , so in its current state, client library is not fully usable in Scio.

clairemcginty avatar Nov 26 '18 20:11 clairemcginty

Update: the client library bug affecting extract jobs has been fixed! https://github.com/googleapis/google-cloud-java/issues/3924

clairemcginty avatar Dec 06 '18 04:12 clairemcginty

@ClaireMcGinty is this still worth looking?

nevillelyh avatar Jan 15 '20 20:01 nevillelyh

Talked IRL, closing.

nevillelyh avatar Jan 15 '20 21:01 nevillelyh

I would like us to reconsider re-opening this. I think there's still some subtle bugs in our current internal BigQuery client. Some of these bugs are related to not fallbacking to env setting properties.

regadas avatar May 20 '20 13:05 regadas

@nevillelyh @ClaireMcGinty what was the reason to not go forward with this?

regadas avatar May 20 '20 13:05 regadas

@regadas If I remember right, it was due to the complexity of integrating with Beam's BigQuery sources/sinks -- Beam returned types from google-api-services-bigquery and a lot of the Google library functions that could convert those to google-cloud-bigquery types were private.

This was awhile ago though, so maybe worth a second look?

clairemcginty avatar May 20 '20 13:05 clairemcginty

@ClaireMcGinty interesting! I think it's worth looking into it again since we are already using the storage impl to actually retrieve data.

Let's see if the other types are good to go as well. I'll book some time to look into this.

Thanks

regadas avatar May 20 '20 14:05 regadas