google-cloud
google-cloud copied to clipboard
Added support for Snappy compression in Avro records for BigQuery Pushdown jobs
Note that a table is uploaded using the Json format (when the schema contains Datetime fields), no compression will be used.
Most of the logic implemented in this classes is a straight copy from the GCP Hadoop connector, with some logic removed/added as needed.
Is snappy packaged as a dependency in the plugin jar or does it assume it's available in the cluster environment?
It may be good to have a way to shut off compression just in case, perhaps by reading some runtime argument. That way, if somebody happens to have problems with it, they can set a system preference and disable it for all their pipelines.
@albertshau Snappy was integrated into Hadoop Common in 2011 https://code.google.com/archive/p/hadoop-snappy/
We can add a config setting for the BQ Pushdown plugin that will allow users to enable/disable compression (enabled by default).
Since BQ Pushdown can be enabled using the Pipeline execution UI, this will work effectively as a runtime argument.