hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-4904] add support for unraveling proto schemas in ProtoClassBasedSchemaProvider

Open the-other-tim-brown opened this issue 3 years ago • 1 comments
trafficstars

Change Logs

If a user provides a recursive proto schema, it will fail when we write to parquet. We need to allow the user to specify how many levels of recursion they want before truncating the remaining data.

Main changes to existing code:

  • ProtoClassBasedSchemaProvider tracks number of times a message descriptor is seen within a branch of the schema traversal
  • once the number of times that descriptor is seen exceeds the user provided limit, set the field to preset record that will contain two fields: 1) the remaining data serialized as a proto byte array, 2) the descriptors full name for context about what is in that byte array
  • Converting from a proto to an avro now accounts for this truncation of the input

Impact

As part of this change, I needed to change how the namespace was set for the Records within the Avro schema. Since we cannot repeat the exact same namespace + name, I made the namespace the path within the schema being traversed so each instance of a recursive message class will have a unique full name.

Marking this as low risk since the protobuf support is scheduled for the 0.13.0 release

**Risk level: low **

Contributor's checklist

  • [x] Read through contributor's guide
  • [x] Change Logs and Impact were stated clearly
  • [x] Adequate tests were added if applicable
  • [ ] CI passed

the-other-tim-brown avatar Sep 23 '22 05:09 the-other-tim-brown

CI report:

  • a922a5beca9991c7b57e9640033075d5e289e5e2 Azure: FAILURE
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Sep 26 '22 21:09 hudi-bot

CI is flaky due to unrelated issue. going ahead w/ merging. https://github.com/apache/hudi/pull/6801

nsivabalan avatar Sep 27 '22 04:09 nsivabalan