Feature/221 serialize enums as flattened structs
This MR isn't complete, still need to add the correct functionality to array builder / sequence builder. This is just as an example of the direction I was going, was hoping for your thoughts?
With this change, from_samples outputs Fields with the desired flattened structure.
Hi @raj-nimble
Thanks. I like the idea a lot!
ATM, I don't really have the time for serde arrow. I will have a proper look next week!
First feedback: it's great, that you target the 0.12 release. I would like to limit the current main to bug fixes for the 0.11 release.
Re builders we can also figure it out together, once I'm back at my laptop. I imagine the deserialization logic will be a bit tricky. Maybe the impl of untagged enums in serde could be an inspiration (I like use the rust playground to expand the generated deserialize impl).
Hi @chmp that's great, I'm happy you like it. I will continue to iterate slowly on my own this week and I look forward to discussing it more with you next week.
Hi @chmp . Thanks so much for the comments! I haven't made any progress since last week as I was working on something else, but with your comments now I can hopefully circle back to this starting next week and I'll update you as soon as I have made some progress. Thanks!
Probably, I would implement this as serialization only for now.
Agreed, it's best to implement the feature as serialization only for now. Can add a separate issue to track adding deserialization.
Hi @chmp ,
I'm struggling to find an elegant way to do the serialization.
I started out attempting to modify StructBuilder to handle it but the issue becomes that the value being serialized is the actual Enum and it was difficult to attempt to detect that in ArrayBuilder, and deconstruct it into it's fields to call serialize on that.
If not detected, and we do something like trying to return the field ArrayBuilder, due to the serialization it ends up calling serialize_struct_field on the various primitive builders and it didn't seem like the best idea to go through and implement that serialization method on all the primitive builders (although maybe that really is the best way?).
I considered redoing the schema creation to continue to use UnionBuilder and then add logic there to serialize it as a struct instead but I think that still ran into problems with arrow writer and seemed more convoluted.
I then considered creating an entirely new builder, something like an EnumStructBuilder. I think that would move the complexity into OuterSequenceBuilder::new, into the build_builder and build_struct inner methods but I'm not quite sure yet. There was some small complexity in how dispatch! would choose to use this new builder which made it less elegant.
Do you have any time to take a closer look and give some guidance on exactly how this might best be done? My 2 current options seem to be to make a completely new builder, or use a small modification on StructBuilder which then adds struct field handling to all the primitive builders.
FYI, I rebased onto the latest from develop-0.12.
Thanks, Raj
@raj-nimble good questions, what's the best option.
One option would probably be to copy the code of UnionBuilder into a separate file and adapt it:
- build a union builder first, then modify it. This should fail if any variant is not serialized as a struct (as otherwise it does not make sense to merge it). I would include the logic merging the field names here.
- merge the different struct builders in
into_array(). Note that this would require to callserialize_noneon the unused builders
In pseudo code:
// probably you need to do something different here, as you plan to use `Struct` as the type with a custom strategy
pub fn from_union_builder(builder: UnionBuilder) -> Result<FlattenedUnionBuilder> {
let builders = Vec::new();
for (child_builder, child_meta) in builder.fields {
// ensure the child builders are nullable here (otherwise `serialize_none` will fail)
let ArrayBuilder::Struct(child_builder) = child_builder else { fail!("Children must be structs") };
// modify the fields to include the variant prefix, the variant name will be in child_meta.name
}
// construct the builder
Ok(FlattenedUnionBuilder { ... })
}
pub fn into_array(self) -> Result<Array> {
let mut fields = Vec::new();
// note: the meta data for the variant is most likely not used, I would simply drop it
for (_, builder) in self.fields.into_iter().enumerate() {
let ArrayBuilder::Struct(builder) = builder else { fail!("..."); };
// concatenate the fields of the different variants
for (field, meta) in builder.fields {
fields.push((idx.try_into()?, builder.into_array()?, meta));
}
}
Ok(Array::Struct(StructArray {
// note: you will most likely need to keep track of the number of elements being added, simply add self.len += 1 or similar in the different `serialize_*` methods
len: self.len,
validity: self.seq.validity,
fields,
}))
}
pub fn serialize_variant(&mut self, variant_index: u32) -> Result<&mut ArrayBuilder> {
self.len += 1;
let variant_index = variant_index as usize;
// call push_none for any variant that was not selected
for (idx, builder) in self.fields.iter_mut().enumerate() {
if idx != variant_idx {
builder.serialize_none()?;
}
}
let Some(variant_builder) = self.fields.get_mut(variant_index) else {
fail!("Could not find variant {variant_index} in Union");
};
Ok(variant_builder)
}
Hi @chmp , thanks so much for your suggestions / guidance, it was immensely helpful. I have pushed up the current draft of work. I haven't written any unit tests or done any cleanup up yet but this version is working with the new builder in the most naive sense, and in addition to working in a small test app, I also tested it within our core application and it works there as well for structs that do not contain UUIDs. I plan to spend the next day or two cleaning it up and writing tests.
At some point, the UUID support could potentially become the next big hurdle. Our application uses them quite extensively, and as part of using the crate in our app, I switched to using from_type instead of from_samples since the latter was causing an Error in serde_arrow with a "root type nullable" issue but maybe that's a bug in my current implementation, not sure yet. Unfortunately, the UUID workaround we arrived at in https://github.com/chmp/serde_arrow/issues/203 requires using from_samples.
I commented here on my current workaround for UUIDs when using from_type
https://github.com/chmp/serde_arrow/issues/203#issuecomment-2362294959
@raj-nimble regarding tests: I will add a method to easily construct the new internal array objects. this way you should be able to easily get the underlying data out and compare whether everything worked.
@raj-nimble I started to work o a proper release for the 0.12 series. Hence the change in base branch.
Hi @chmp sorry for the delay here. I got pulled away from this for a bit but I'm really hoping to finish up here this week or early next week. Did you get a chance yet to add support to simplify getting the data of the arrays so I can move the tests?
Great to hear!
Re.: getting the data out of the arrays: you can use build_arrays of the ArrayBuilder and then match on the variants of the Array enum. Something along the lines of
// see also: https://github.com/chmp/serde_arrow/blob/main/serde_arrow/src/test_with_arrow/impls/chrono.rs#L659
let mut builder = ArrayBuilder::new(schema).unwrap();
builder.extend(&items).unwrap();
let arrays = builder.build_arrays().unwrap();
let [array] = arrays.try_into().unwrap();
let Array::Struct(array) = array else {
panic!("Expected struct array, got: {array:?}");
};
let (first_field, meta) = &array.fields[0];
assert_eq!(meta.name, "...");
assert_eq!(first_field, Array::Int32(PrimtiveArray { validity: None, values: vec![1, 2, 3, 4] }));
Hi @chmp I have an issue and I am not yet certain how to fix it. Maybe you can provide some insight?
I have a test case pushed up that reproduces the issue at serde_arrow/src/test_with_arrow/impls/flattened_union.rs.
The error I am getting is:
thread 'test_with_arrow::impls::flattened_union::test_flattened_union_with_nested_enum' panicked at serde_arrow/src/test_with_arrow/impls/flattened_union.rs:230:10:
failed to serialize: Error: Cannot push null for non-nullable array (data_type: "UInt32", field: "$.data.Two.opts.loc.key")
To repro/demonstrate, lets say we have this nested struct->enum->struct->enum structure
#[derive(Serialize, Deserialize)]
struct ComplexMessage {
data: MsgData,
}
#[derive(Serialize, Deserialize)]
enum MsgData {
One { data: usize },
Two { opts: MsgOptions },
}
#[derive(Serialize, Deserialize)]
struct MsgOptions {
loc: Location,
}
#[derive(Serialize, Deserialize)]
enum Location {
Left,
Right,
}
and we have a schema and data as follows
fn nested_enum_schema() -> SerdeArrowSchema {
let options = TracingOptions::default()
.allow_null_fields(true)
.enums_without_data_as_strings(true)
.enums_with_named_fields_as_structs(true);
SerdeArrowSchema::from_type::<ComplexMessage>(options).unwrap()
}
fn nested_enum_data() -> Vec<ComplexMessage> {
vec![
ComplexMessage {
data: MsgData::One { data: 3 },
},
ComplexMessage {
data: MsgData::Two {
opts: MsgOptions {
loc: Location::Right,
},
},
},
]
}
the error above shows up when we serialize
let mut builder = ArrayBuilder::new(nested_enum_schema()).unwrap();
let serializer = Serializer::new(builder);
nested_enum_data()
.serialize(serializer)
.expect("failed to serialize")
.into_inner()
.to_arrow()
.expect("failed to serialize to arrow");
The fields look like this
[
Struct(
StructArray {
len: 0,
validity: None,
fields: [
(
UInt64(
PrimitiveArray {
validity: Some(
[],
),
values: [],
},
),
FieldMeta {
name: "One::data",
nullable: true,
metadata: {},
},
),
(
Struct(
StructArray {
len: 0,
validity: Some(
[],
),
fields: [
(
Dictionary(
DictionaryArray {
indices: UInt32(
PrimitiveArray {
validity: None,
values: [],
},
),
values: LargeUtf8(
BytesArray {
validity: None,
offsets: [
0,
],
data: [],
},
),
},
),
FieldMeta {
name: "loc",
nullable: false,
metadata: {},
},
),
],
},
),
FieldMeta {
name: "Two::opts",
nullable: true,
metadata: {},
},
),
],
},
),
]
As I understand it, what is happening is:
- The first variant in the first data structure is serialized, and 2nd variant builder is skipped, so we call
StructBuilder::serialize_none() StructBuilder::serialize_none()callsDictionaryUTF8Builder::serialize_default()DictionaryUTF8Builder::serialize_default()callsIntBuilder::serialize_none()IntBuilder::serialize_none()callsself.array.push_scalar_none()which callsset_validity()with no buffer and value false which fails
So while I understand how we arrive at that section of the code, I don't really understand the underlying behavior of the Dict builder or Int Builder on why this is so. I believe the Dict Builder is a result of the TracingOptions::enums_without_data_as_strings field which we do want but I'm not sure how to approach allowing this variant to be none.
Do you have any thoughts on what I might need to do here?
@raj-nimble Oh that's a tricky. The reason is the following:
- A null struct value needs to have value for its fields, that why
serialize_nonecallsserialize_defaultfor each of its fields - The call to
serialize_defaultin the dictionary array serializes a null value, as there is no reasonable default. The default for strings, an empty string, is not guaranteed to exist. The index0is not guaranteed to map to a valid string, if there are no values for this dictionary (probably a corner case, but nonethless)
One fix would be to force nested dictionary fields to be nullable with enums_with_named_fields_as_structs, regardless of the data. Something like:
fn fix_dictionaries(field: &mut Field) {
if matches!(field.data_type, DataType::Dictionary(_, _, _)) {
field.nullable = true;
} else if let DataType::Struct(children) = &mut field.data_type {
for child in children {
fix_dictionaries(child);
}
}
}
The recursion on struct is necessary, as nested structs that contain dictionaries will trigger the same error.
Just a heads up: unions in Arrow are never nullable. For nested enums, that cannot be mapped to dictionaries, you will run into a similar issue. Unfortunately, I have no idea what would be a solution here. I would suggest, to simply catch this case and raise an error in schema tracing.
Maybe it would also be worthwhile to add checks for both issues, non-nullable dictionaries in structs and enums in structs, to the serializer. The schema can also be constructed manually and would result in hard to track down errors.
Hi @chmp , sorry for the recent radio silence and lack of progress. I got pulled into more urgent tasks but this has definitely not fallen off my radar. I do plan to return to this hopefully in the next month.
For a summary of where I left off:
- I think most of it is good to go. The naive happy-path behavior is all working and some extended testing in our main application was all working as expected. There are some limitations but they were all expected and able to be handled.
- I think there is probably some gaps in unit test coverage still.
- I had found one specific bug that I was investigating at the time I dropped off. I have a unit test that reproduces the issue but I had not started investigating it yet.
- My plan when I can come back to this is to fix that bug, confirm with our application, add any other possibly missing unit tests, then open this up for official review.
Sorry for the delay with this.
@raj-nimble Thanks for the update. And no worries. I myself am quite busy with other things and can't really devote much time to serde_arrow. Just give me a ping when I should support :)