karmem icon indicating copy to clipboard operation
karmem copied to clipboard

Support for Unions

Open delaneyj opened this issue 2 years ago • 4 comments

Used heavily in my flat buffer code for event type switching but appears to be missing here.

delaneyj avatar Aug 01 '22 14:08 delaneyj

That is something that I'm looking to implement. Currently, the main-branch, already generates PacketIdentifier, which is deterministic based on the name of the struct (that will be improved soon). That is the first step to introduce Union, but already enough for simple cases of use.

The main issue around Union is the space re-use, since one single field can be multiple types, then invalidating any pre-allocated arrays/structs living there, which causes allocations and put more pressure on the GC. I'm not sure how to mitigate it, without making the API too complex.

inkeliz avatar Aug 07 '22 12:08 inkeliz

I'm reviewing that issue again. But, I can't find one balance between Performance VS Usability. Or, something may have better performance (and fewer allocations) or it's easy to use (and then generate alot of garbage).

Consider the following Golang code:

type MyUnion interface {}

struct MyStruct {
     Field MyUnion,
}

The empty interface acts like one tag-interface, so other types of struct can implement it, and it will be one valid Field. On languages like C# and Swift, that is equivalent of empty interface/protocol. On languages like C, that is equivalent of void * + uint32_t, to check the type.

The issue: everytime you want to decode you might need to allocate and can't re-use the existent MyUnion. So, if A and B is valid MyUnion, if you enconding A, then B, then A, you need to allocate in all cases.

However, if we change this design to:

type MyStruct struct  {
  FieldType uint32,
  FieldAsA *A,
  FieldAsB *B,
}

Supposing the same situation: A then B then A, the last A allocation can be mitigated, because the FieldAsA will be already populated and can be reused.

However: that is terrible for usability and it may leak data, if improperly used or set wrong FieldType.


Currently, we have one (bad) way of doing union, one of my experimentation is:

struct MyStruct table @packed(true) {
     FieldA [<1]A;
     FieldB [<1]B;
}

That will use 20 bytes, because each field is 8 bytes + 4 bytes of table header. However, that matches with the following Go code:

type MyStruct struct {
FieldA []A;
FieldB []B;
}

Similar to the previous construction, that will allow the allocated value to be re-used. That is also easier to understand, since almost all modern languages (Go, Odin) have ways to check the capacity and length. If the capacity is 1 and the length is 0, we can increase the size to 1 and then re-use the variable allocated. But, some langues are terrible to do it (such as C#, Swift).

That construction is also great for SoA, so if we have []MyUnion, the same constrution could be used. Of course, that will be unordered.


Another alternative is to use sync.Pool. Or, even state to not use DecodeAPI at all, and only use Viewer, which doesn't have any performance issue. However, that doesn't fix the writer, which would require users to implement their own sync.Pool, working to create one RawWriter is also possible, but doesn't fix the core issue of Decode/Encode API.

inkeliz avatar Aug 30 '22 18:08 inkeliz

My thought was to have something using union constraints from 1.18 like

type UnionAB interface{
  *A|*B
} 
type MyStruct struct  {
  UnionAB foo
}

You'd still have to have the discriminator over the wire but the interface in Go could stay clean

delaneyj avatar Aug 31 '22 19:08 delaneyj

Unions are critical some use cases; when you need them, you can't get by without them. You end up adding them ad-hoc in the application code on top of the spec.

I don't have any thoughts on the Go interface; I think the wire protocol should come first and then the experience in any particular language can build on top of that. For the wire protocol, maybe we can steal from Flatbuf and C. We'd need at least the discriminant. It seems like the rest of the language has no focus on forward/backward compatibility, being intended for in-memory use-cases, so perhaps a length field is not necessary. In-memory you could represent just like a C union, i.e. a Union<A, B> is the sizeof discriminant + max_size(A, B). It appears go has some support for 'unsafe' or whatever tricks you would need to cast this buffer back to the correct type based on the discriminant.

haydenflinner avatar Apr 18 '23 16:04 haydenflinner