capnproto-rust
capnproto-rust copied to clipboard
Null values for non active union members
I am converting a series of capnp messages into a columnar format (arrow specifically). One of the challenges with unions is non-active union fields. I recursively create a vector of dynamic value readers for each field and then convert that into arrays of arrow memory. When the field is a member of a union and active this works fine. When the field is not active then this creates fake data instead of null.
For example the schema:
struct TestUnion {
union {
foo @0 :UInt16;
bar @1 :UInt32;
}
}
With data: [{"foo": 1}, {"bar": 1}]
generates the output: [{"foo": 1, "bar": 0}, {"foo": 0, "bar": 1}]
. What I would expect is [{"foo": 1, "bar": null}, {"foo": null, "bar": 1}]
. I have tried creating dynamic_value::Reader::Void
when the field is non-active, but this is challenge with nested struct and list types.
For structs I have tried creating a new empty dynamic_struct::StructReader
using the private layout:
match capnp_field.get_type().which() {
introspect::TypeVariant::Struct(st) => {
dynamic_value::Reader::Struct(dynamic_struct::Reader::new(layout::StructReader::new_default(), schema::StructSchema::new(st)))
}
}
This still leads to primitive ints with 0
value.
Is it possible to create readers with null values?
Would it make sense to have non-active union fields have null values (I assume the expectation is users check has
to find active values and ignore non-active values)?
What are you using to convert your dynamic values into JSON?
The stringify.rs
logic is an example of how to iterate through the fields of a dynamic struct while accounting for union fields: https://github.com/capnproto/capnproto-rust/blob/eaad5e57451ea272d9a0fc0f1bb39c5d63f5c1b0/capnp/src/stringify.rs#L105-L166
The json was just a visual example. I am actually converting into arrow arrays and then Polars series for a Polars dataframe.
I will dig into the stringify to see if that has the logic I am missing. My problem may be different because I am going from row-wise into columnar.
My real input are binary files with an unknown number of messages. I create the arrow schema with all of the same fields as the capnp schema. Then iterate through the fields in the schema to create a vector of capnp readers. Then the capnp readers are converted into arrow arrays.
The main problem with nested types is that I need to represent a struct and all the types within it even if it is not active in the union.
struct OuterStruct {
struct InnerStruct {
textField @0 :Text;
}
union {
intField @0 :UInt16;
structField @1 :InnerStruct;
}
}
If we have three messages (I would actually convert to binary before running them in tests):
{"structField": {"textField": "first"}}
{"intField": 2}
{"structField": {"textField": "third"}}
I need to create three arrow arrays: a UInt16Array
for intField
, a Utf8Array
for textField
, and a StructArray
for structField
. This gives a dataframe that looks basically like this json:
{
"intField": [null, 2, null],
"structField": [{"textField": "first"}, {"textField": null}, {"textField": "third"}]
}
To help make these arrays my plan is to make the following capnp readers in this psuedocode (values of primitives in comments):
use capnp::dynamic_value::Reader;
let int_field = vec![Reader::Void, Reader::UInt16, Reader::Void]; // null, 2, null
let struct_field = vec![Reader::Struct(Reader::Text), Reader::Struct(Reader::Void), Reader::Struct(Reader::Text)]; // "first", null, "third"
The challenge is getting a struct with a null textField
. Making a struct reader with all the primitive types being replaced by Void
readers is the main challenge I don't know how to solve. The entire reason I am working with Void
readers at all is the my recursive traversal of the schema gives the following output:
use capnp::dynamic_value::Reader;
let int_field = vec![Reader::UInt16, Reader::UInt16, Reader::UInt16]; // 0, 2, 0
let struct_field = vec![Reader::Struct(Reader::Text), Reader::Struct(Reader::Text), Reader::Struct(Reader::Text)]; // "first", "", "third"
Another option would be to have the primitive readers that are non-active fields yield null values. This is the line in my code that extracts the primitive values. Note that the code I am testing on unions has not been pushed.
Does this help explain the problem?
Have you tried making the first member of your union a dummy unset @0 :Void
?
See https://capnproto.org/language.html#unions
By default, when a struct is initialized, the lowest-numbered field in the union is “set”. If you do not want any field set by default, simply declare a field called “unset” and make it the lowest-numbered field.
Said differently, in capnproto unions are not messages, the union is not a pointer that can be left null, the union members are inline in that place in the message, and leaving that as all-zeroes just means @0
with zero values for all fields. See "Wait, why aren’t unions first-class types?" in https://capnproto.org/language.html#unions