parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

PARQUET-1020: Support for Protobuf DynamicMessages by allowing user to set descriptor

Open a-buck opened this issue 8 years ago • 5 comments

Hello,

This is an initial implementation for this feature. I haven't added tests yet and perhaps it could be tidied up a bit but wanted to make sure that I am on the right track before spending too much time on it. https://issues.apache.org/jira/browse/PARQUET-1020

Please let me know what you think.

Many thanks,

Alex

a-buck avatar Jun 06 '17 17:06 a-buck

CCing folks who showed interest in parquet-proto: @qinghui-xu @matt-martin @lukasnalezenec @costimuraru @kgalieva: comments?

julienledem avatar Jun 09 '17 18:06 julienledem

I've made a few changes on top of @a-buck changes so it doesn't depend on static state but uses the hadoop config to pass the Protobuf descriptors (as base64 strings).

The code is here: https://github.com/andredasilvapinto/parquet-mr/commit/9669b634bd001858ff4261609c07156cfa4f35c1

I can add a few tests if you find this approach valuable. We are using it in production to generate Parquet from Protobuf for multiple different data sets (with multiple PB formats) without having to keep compiling and including new versions of the compiled ProtoBuf schemas (we just download the latest FileDescriptorSet from S3 on every run) and then:

ProtoParquetOutputFormat.setProtobufSchema(job, schemaBase64, protoFile, messageType);

protoFile being the name of the original .proto file inside the FileDescriptorSet and messageType the name of the ProtoBuf message.

andredasilvapinto avatar Sep 05 '17 10:09 andredasilvapinto

Hi, good idea. We should not use static state and i would prefer to extract this code to some strategy (design pattern) class. Everybody could inject own strategy implementation using hadoop configuration.

lukasnalezenec avatar Mar 07 '18 13:03 lukasnalezenec

This functionality is exactly what we need. How do we move this PR forward? I would be willing to make additional required code changes. Is the change requested by @lukasnalezenec whats stopping this from being accepted?

stejskal avatar Sep 04 '18 18:09 stejskal

This PR seems to have stagnated but this is exactly what we're looking for, if I fork and fix those conflicts can we reignite the discussion?

alexcardell avatar Jun 12 '20 14:06 alexcardell