google-cloud icon indicating copy to clipboard operation
google-cloud copied to clipboard

Added support for Snappy compression in Avro records for BigQuery Pushdown jobs

Open fernst opened this issue 3 years ago • 2 comments

Note that a table is uploaded using the Json format (when the schema contains Datetime fields), no compression will be used.

Most of the logic implemented in this classes is a straight copy from the GCP Hadoop connector, with some logic removed/added as needed.

fernst avatar Jul 13 '21 03:07 fernst

Is snappy packaged as a dependency in the plugin jar or does it assume it's available in the cluster environment?

It may be good to have a way to shut off compression just in case, perhaps by reading some runtime argument. That way, if somebody happens to have problems with it, they can set a system preference and disable it for all their pipelines.

albertshau avatar Jul 13 '21 15:07 albertshau

@albertshau Snappy was integrated into Hadoop Common in 2011 https://code.google.com/archive/p/hadoop-snappy/

We can add a config setting for the BQ Pushdown plugin that will allow users to enable/disable compression (enabled by default).

Since BQ Pushdown can be enabled using the Pipeline execution UI, this will work effectively as a runtime argument.

fernst avatar Jul 13 '21 17:07 fernst