kafka-connect-hdfs icon indicating copy to clipboard operation
kafka-connect-hdfs copied to clipboard

Set GZIP compression for Parquet FIle

Open zizake opened this issue 5 years ago • 2 comments

Hello,

I have the following configuration for sink connector. Is there any possibility to set a custom compression for Parquet files? By default is Snappy, i would like to change it to GZIP due to the better ratio of compression.

In hive the equivalent would command would be: SET parquet.compression=GZIP;

connector.class=io.confluent.connect.hdfs.HdfsSinkConnector hadoop.conf.dir=/etc/hadoop/conf flush.size=10000 schema.compatibility=BACKWARD tasks.max=1 topics=kafka_playground timezone=UTC hdfs.url=hdfs://XXXXXXXXXXXXXx:8020 hive.metastore.uris=thrift://XXXXXXXXXXX:9083 locale=en-us key.converter.schemas.enable=false value.converter.schema.registry.url=http://XXXXXXXXXXXXXX:8081 hive.integration=true format.class=io.confluent.connect.hdfs.parquet.ParquetFormat partitioner.class=io.confluent.connect.hdfs.partitioner.HourlyPartitioner value.converter=io.confluent.connect.avro.AvroConverter

Thanks!

zizake avatar Feb 24 '20 12:02 zizake

@zizake unfortunately it looks like we don't support changing the compression yet, but that could be a good contribution if you are interested in opening a PR

levzem avatar Apr 17 '20 03:04 levzem

@levzem I see the following claim in the documentation. If that's so, does that mean the documentation is not accurate?

parquet.codec The Parquet compression codec to be used for output files. Type: string Default: snappy Valid Values: [none, snappy, gzip, brotli, lz4, lzo, zstd] Importance: low

moeinxyz avatar Jul 12 '23 13:07 moeinxyz