Add support for parquet logical types to parquet_encode processor
In some cases, users will need to specify the logical type in the schema field. Details here: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
For example, when using type: BYTE_ARRAY to encode a string value, they might want to set the logical type to STRING so decoders will be able to interpret it correctly. For example, given this config:
input:
generate:
mapping: root.test = "deadbeef"
count: 1
interval: 0s
pipeline:
processors:
- parquet_encode:
schema:
- name: test
type: BYTE_ARRAY
output:
file:
path: output.parquet
codec: all-bytes
will produce a parquet binary which, when decoded with parquet-tools will contain a base64-encoded value:
> docker run --rm -v$(pwd):/tmp/parquet nathanhowell/parquet-tools cat /tmp/parquet/output.parquet
test = ZGVhZGJlZWY=
however, if we change this line of code to n = parquet.String(), then parquet-tools will output test = deadbeef.
This issue seems to also cause Pandas to fail with "OSError: Not yet implemented: DecodeArrow for DeltaLengthByteArrayDecoder.": https://github.com/segmentio/parquet-go/issues/325
I've added a UTF8 option for column values: https://github.com/benthosdev/benthos/commit/07ed81b150778a362e25e52428c59a05ca21369b as a quick work around. Technically I think we ought to be exposing logical types with a seperate field but we can cross that bridge later.