capnproto-rust icon indicating copy to clipboard operation
capnproto-rust copied to clipboard

Null values for non active union members

Open quartox opened this issue 1 year ago • 3 comments

I am converting a series of capnp messages into a columnar format (arrow specifically). One of the challenges with unions is non-active union fields. I recursively create a vector of dynamic value readers for each field and then convert that into arrays of arrow memory. When the field is a member of a union and active this works fine. When the field is not active then this creates fake data instead of null.

For example the schema:

struct TestUnion {
  union {
    foo @0 :UInt16;
    bar @1 :UInt32;
  }
}

With data: [{"foo": 1}, {"bar": 1}] generates the output: [{"foo": 1, "bar": 0}, {"foo": 0, "bar": 1}]. What I would expect is [{"foo": 1, "bar": null}, {"foo": null, "bar": 1}]. I have tried creating dynamic_value::Reader::Void when the field is non-active, but this is challenge with nested struct and list types.

For structs I have tried creating a new empty dynamic_struct::StructReader using the private layout:

match capnp_field.get_type().which() {
    introspect::TypeVariant::Struct(st) => {
        dynamic_value::Reader::Struct(dynamic_struct::Reader::new(layout::StructReader::new_default(), schema::StructSchema::new(st)))
    }
}

This still leads to primitive ints with 0 value.

Is it possible to create readers with null values? Would it make sense to have non-active union fields have null values (I assume the expectation is users check has to find active values and ignore non-active values)?

quartox avatar Dec 09 '23 14:12 quartox

What are you using to convert your dynamic values into JSON?

The stringify.rs logic is an example of how to iterate through the fields of a dynamic struct while accounting for union fields: https://github.com/capnproto/capnproto-rust/blob/eaad5e57451ea272d9a0fc0f1bb39c5d63f5c1b0/capnp/src/stringify.rs#L105-L166

dwrensha avatar Dec 09 '23 15:12 dwrensha

The json was just a visual example. I am actually converting into arrow arrays and then Polars series for a Polars dataframe.

I will dig into the stringify to see if that has the logic I am missing. My problem may be different because I am going from row-wise into columnar.

My real input are binary files with an unknown number of messages. I create the arrow schema with all of the same fields as the capnp schema. Then iterate through the fields in the schema to create a vector of capnp readers. Then the capnp readers are converted into arrow arrays.

The main problem with nested types is that I need to represent a struct and all the types within it even if it is not active in the union.

struct OuterStruct {
  struct InnerStruct {
    textField @0 :Text;
  }
  union {
    intField @0 :UInt16;
    structField @1 :InnerStruct;
  }
}

If we have three messages (I would actually convert to binary before running them in tests):

{"structField": {"textField": "first"}}
{"intField": 2}
{"structField": {"textField": "third"}}

I need to create three arrow arrays: a UInt16Array for intField, a Utf8Array for textField, and a StructArray for structField. This gives a dataframe that looks basically like this json:

{
"intField": [null, 2, null],
"structField": [{"textField": "first"}, {"textField": null}, {"textField": "third"}] 
}

To help make these arrays my plan is to make the following capnp readers in this psuedocode (values of primitives in comments):

use capnp::dynamic_value::Reader;
let int_field = vec![Reader::Void, Reader::UInt16, Reader::Void]; // null, 2, null
let struct_field = vec![Reader::Struct(Reader::Text), Reader::Struct(Reader::Void), Reader::Struct(Reader::Text)]; // "first", null, "third"

The challenge is getting a struct with a null textField. Making a struct reader with all the primitive types being replaced by Void readers is the main challenge I don't know how to solve. The entire reason I am working with Void readers at all is the my recursive traversal of the schema gives the following output:

use capnp::dynamic_value::Reader;
let int_field = vec![Reader::UInt16, Reader::UInt16, Reader::UInt16]; // 0, 2, 0
let struct_field = vec![Reader::Struct(Reader::Text), Reader::Struct(Reader::Text), Reader::Struct(Reader::Text)]; // "first", "", "third"

Another option would be to have the primitive readers that are non-active fields yield null values. This is the line in my code that extracts the primitive values. Note that the code I am testing on unions has not been pushed.

Does this help explain the problem?

quartox avatar Dec 10 '23 00:12 quartox

Have you tried making the first member of your union a dummy unset @0 :Void?

See https://capnproto.org/language.html#unions

By default, when a struct is initialized, the lowest-numbered field in the union is “set”. If you do not want any field set by default, simply declare a field called “unset” and make it the lowest-numbered field.

Said differently, in capnproto unions are not messages, the union is not a pointer that can be left null, the union members are inline in that place in the message, and leaving that as all-zeroes just means @0 with zero values for all fields. See "Wait, why aren’t unions first-class types?" in https://capnproto.org/language.html#unions

tv42 avatar Feb 02 '24 22:02 tv42