flatbuffers icon indicating copy to clipboard operation
flatbuffers copied to clipboard

What would a "FlatBuffers2" binary format look like?

Open aardappel opened this issue 4 years ago β€’ 135 comments

FlatBuffer's binary format has been set in stone for 6.5 years now, because we value binary forwards/backwards compatibility highly, and because we have a large investment in 15? or so language implementations / parsers etc. that would not be easy to redo.

So "V2" in the sense of a new format that breaks backwards compatibility may never happen. But there is definitely a list of issues with the existing format that if a new format were to ever happen, would be nice to address. I realized I never made such a list. It would be nice to at least fantasize what such a format could look like :)

Please comment what you would like to see. Note that this list is purely for things that would change the binary encoding, or larger additions to the binary encoding. Anything that can be solved with code / new APIs outside of the binary format does not belong on this list.

  1. Remove all padding. Modern CPUs can access unaligned data at normal speed. This would shrink the format somewhat, and encourage other variable size things. If anyone ever needs padding to be compatible with a C struct, explicit padding can always be added, or it can be an opt-in feature.
  2. Make unions into a single field (and vectors of them into a single value). Also make the type part 16-bit while we're at it, so a union is always a 6-byte struct.
  3. Remove the 2nd field of the vtable. This field stores the table size, but it is never used in any implementations. This was intended for a streaming API that never happened.
  4. Allow different size vtable offsets. Currently they're always 16-bit, but for small tables 8-bit would be feasible. Since we code-gen vtable access, this would come at no cost. Use with care of course, because once you choose this smaller size you can't undo it when your table grows.
  5. Allow inline vtables when we determined they're unlikely to be shared. Saves an offset.
  6. Allow inline strings, vectors (and maybe scalars), meaning a vtable offset would refer directly to the string, rather than to the string offset. Saves the offset. Of course puts more pressure on the vtable offset size, so use with case. Similarly, could even do inline scalars of all small scalar types. Of course this makes it more likely that vtables are unequal, so this is a tradeoff.. would work well with inline vtables.
  7. Remove 0-termination of strings. Only C/C++ care for this, and C++ has been moving toward string_view recently, and both have been using size_t arguments for a long time rather than relying on strlen. Other languages don't use it. For passing to super-old C APIs that expect 0-termination, either swap the terminating byte temporarily while passing that string, or copy.
  8. Allow 8 and 16 bit size fields on strings and vectors, currently they're always 32. Good for small strings. Combine all the string optimisations above together, and the string "a" goes from 12 bytes (2 vtable + 4 offset + 4 size + 1 string + 1 terminator) to 3 bytes (1 vtable + 1 size + 1 string). Of course this very inflexible and special purpose, but gives users more options for compact data storage. Again, like all format variation above, this comes at no runtime cost, just some codegen complexity.
  9. Construct the buffer forwards (rather than backwards like currently all implementations). This simplifies a lot of code and buffer management. Unsigned child offsets would now always point downwards in memory. Downside: must now detect table fields pointing to the table itself.
  10. Always have a file_identifer, and make it the first thing in the buffer? Always have a length field as well?
  11. Support 64-bit offsets from the start. They would be optional for vectors and certain other things, allowing buffers >2GB. See https://github.com/google/flatbuffers/projects/10#card-14545298
  12. For a buffer that has entirely un-shared vtables (see 5), it now becomes more feasible to allow in-place mutation of more complex things. This is definitely a complex/contentious feature, but I think if we ever re-booted the format this should be designed in from the start if possible.
  13. Deeply integrated FlexBuffers, basically allowing any field to cheaply be a FlexBuffers value such that it effectively becomes FlatBuffers's "dynamic type". Sharing of strings across such values rather than being an isolated nested buffer.
  14. Nested vectors. Not strictly a breaking change, but a new format would probably want to have them from the start.
  15. Built-in LEBs (variable sized integers) as an optional varint type for fields. They could be added to the existing format but make a lot more sense in a system with no alignment.

@rw @mikkelfj @vglavnyy @mzaks @mustiikhalil @dnfield @dbaileychess @lu-wang-g @stewartmiles @alexames @paulovap @AustinSchuh @maxburke @svenk177 @jean-airoldie @krojew @iceb0y @evanw @KageKirin @llchan @schoetbi @evolutional

aardappel avatar Apr 26 '20 19:04 aardappel

Some quick thoughts:

Ad 1. We would need to (potentially) deal with compiler/target platform problems when dealing with unaligned access. Is adding padding such a big problem at the moment?

Ad 4. This sounds like a compatibility problem while migrating schemas.

krojew avatar Apr 26 '20 19:04 krojew

@krojew

  1. is not a problem. In the most basic case you make every scalar load go thru memcpy which gets optimized by every compiler (tested with clang, gcc and vs) into a single memory load, but one that is guaranteed to work unaligned. Padding is not a huge problem, but not having it enables a lot of other features (see the rest of the list) which would normally be pointless since 32-bit alignment is so prevalent.

  2. Not sure what you mean. You would opt-in to 8-bit offsets. Once you put that in your schema, that table will ways use 8-bit, in all languages, and never change.

aardappel avatar Apr 26 '20 19:04 aardappel

random thoughts:

Overall, I like many of the suggestions, but there are too many optional and variable parameters in the proposal which would make things slow. There is a lot more branching going on, and code complexity.

detractors to Wouters original comment:

  • Ad. 4. variable size vtables make it slow.
  • Ad 5. inline vtables not a good idea - would require slow extra check always, or at the very least it would have to be specific to the table type.
  • Ad 6. Optional offsets are slow. It would need to be a separate type. It's just too complex.
  • Ad 11. 64-offsets optional - also extra check slow speed, but do think we need 64 bit offsets as a separate buffer type.
  • Ad 7. removal of 0 termination in strings: I'm not sure what that buys us. It saves a single byte if we no longer have padding requirements, that is all. I'm all in favor in string views when possible - use similar in C, but there is no standard, and it would force a memory copy in many C and C++ API calls, for example file names. I don't like that strings are different in FlatBuffers, but not having a 0 is worse.
  • Ad 12. mutations: these are inherently unsafe - because verifiers would be expensive if they should check that this is safe - so I wouldn't design the format around that. I would make it simpler to copy a buffer or part of a buffer without knowing its type.
  • Ad 13. FlexBuffers - I'm not a big fan. I see it as a complication. I'd rather have strong JSON integration.
  • Ad 10. In the current FB format you cannot know if there is file identifier, so in that sense it is good to always require it, but I think it is not used very often in praxis and many different tables in the schema might be using the same file identifier. The type hash that FlatCC introduced to work around that was never broadly adopted. Human specified identifiers are too easy to conflict in 4 bytes. So I think it would be better to remove it entirely. We should also think about always having a length or size field, but allow it to be 0 if unknown, e.g. while streaming - but the size of the field is open - is 32 bits, variable length, or what. If variable length it cannot easily be updated after streaming.
  • Ad. 14 Nested vectors are good.

my inclusive comments:

  • Ad 1. I think we can remove padding - it causes a lot of complexity and it is not necessary on most platforms. On C, a flag could mark a platform is unaligned unfriendly - there already is - and then accessors would read differently - it already does in some cases.
  • need a 16-bit union type
  • ability to copy tables without knowing its type - requires some vtable annotation. Would allow some generated code to be library code, and allow some gateway processing without knowing the fully schema.
  • drop nested tables - they are unreasonably complex to get right, although absence of padding would simplify this.
  • (Ad 8.) We could use a varint format for some fields. The QUIC protocol adopted an unsigned big endian encoding where first byte bit 6 and 7 codes length 1, 2, 3, or 4 bytes. That works very well for size fields. For flatbuffer offsets this could also work if the type was signed, but I am afraid it would slow things down significantly.
  • (Ad 9.) ability to stream buffers while being written - always use signed offsets, also better for some languages that dislike unsigned types, especially if 64 bit - see StreamBuffers: https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md#streambuffers
  • Support for mixings beyond: https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md#mixins - This would require support in the new format to avoid having subtypes always being remote tables instead of inline - also requires some thinking - but last I looked at it, it would be good.

I really would like to remove nested buffers - I has caused me so much headache and they haven't been properly implemented elsewhere. We could have a packed buffer format where multiple buffers an be embedded in the same file or memory block and referenced by some identifier without actually storing the buffer inside the other buffer. It could be an 8-byte random identifier. Buffers could also be given a random initial identifier in this format. Just a thought. This means how buffers are stacked or packed is not critical as long as they can be located.

mikkelfj avatar Apr 26 '20 19:04 mikkelfj

We also need a proper NULL type for representing database data.

mikkelfj avatar Apr 26 '20 19:04 mikkelfj

Somehow I had persuaded myself that 9) doesn't need a format change and that it could be done by just building from the other end of the buffer and allocating space for the full message and vtable when creating the object. It would require a pretty massive API change though. Maybe I'm just hopeful.

My understanding of the premise of flatbuffers (and Cap'n'Proto, which we evaluated before picking flatbuffers) is that compression algorithms are wonderful, so don't burden the format with being clever and compact. Protobufs attempt that, and end up requiring a separate serialization/deserialization step. Sharing vtables goes against that.

Most of the rest of my feedback is at the API level. Happy to give it if there is interest.

AustinSchuh avatar Apr 26 '20 23:04 AustinSchuh

  • One of the selection factors that resulted in us picking flatbuffers over other formats like protobufs was that flatbuffers doesn't use varints. If they pop up in v2, perhaps it could be an opt-in? Or maybe an attribute applied to fields?

  • I agree with @AustinSchuh about compression; I think flatbuffers' niche is that by default they are very efficient at runtime, even if they trade off space for that efficiency. In our use we transport flatbuffers over the wire in http that's gzipped/deflated/brotli'd, and on disk we persist them squashed with zstandard, so encoding tricks I think wouldn't really buy us much, but would hurt in our application use.

  • Please-oh-please add size prefixing by default.

  • I would almost prefer an inversion of the current required/optional field status, having fields set to be required by default unless annotated to be optional.

maxburke avatar Apr 27 '20 02:04 maxburke

2. Not sure what you mean. You would opt-in to 8-bit offsets. Once you put that in your schema, that table will ways use 8-bit, in all languages, and never change.

If that's an explicit opt-in, then it seems fine.

  • I would almost prefer an inversion of the current required/optional field status, having fields set to be required by default unless annotated to be optional.

I absolutely agree with this. I would extend this a bit further to add a proper optional scalar fields. Adding bool flags if a scalar is present or not, as we need to right now if 0 is a legitimate value, is a quite frustrating workaround, given non-scalars have this built-in.

krojew avatar Apr 27 '20 05:04 krojew

  • I would almost prefer an inversion of the current required/optional field status, having fields set to be required by default unless annotated to be optional.

I absolutely agree with this. I would extend this a bit further to add a proper optional scalar fields. Adding bool flags if a scalar is present or not, as we need to right now if 0 is a legitimate value, is a quite frustrating workaround, given non-scalars have this built-in.

We have optional scalar fields today. It just isn't plumbed up to be accessible by default to the user. The offset for the scalar fields in the vtable is 0 if they aren't populated. See CheckField<> and GetField<> (for how it handles defaults) for the gory implementation details. (I've got a patch which implements has_ it in C++ if there is upstream interest)

Protobuf started out with required being the default and concluded that it was a bad idea. https://github.com/protocolbuffers/protobuf/issues/2497 is a small part of the discussion.

AustinSchuh avatar Apr 27 '20 06:04 AustinSchuh

We have optional scalar fields today. It just isn't plumbed up to be accessible by default to the user.

If something isn't publicly available, it doesn't exist from the user perspective. We should expose such information.

Additionally, I have an impression we're reaching the classic dilemma of speed vs size. So the first question we need to answer is - which one do we favor? For me, FB was always about performance, so if we need to keep padding or sacrifice some internal improvements for the sake of being fast, I would personally stick with that. If some improvement doesn't impact performance, then let's consider it.

krojew avatar Apr 27 '20 06:04 krojew

  1. Allow different size vtable offsets. [...] you can't undo it when your table grows.

Maybe a varint?

rw avatar Apr 27 '20 07:04 rw

Would it be possible to have most (all?) of these features be specified with flags at the beginning of a payload? I know that that gets into "framing format" territory, but providing a set of feature flags at the beginning of a table (or at the beginning of an entire buffer) could allow a lot of flexibility at little cost.

It would probably just be an optional initial table at the beginning, containing metadata.

rw avatar Apr 27 '20 07:04 rw

My 5 cents.

  • removing padding, is generally a good thing, I am a bit concerned with possible crashes. In FlatBuffersSwift padding is optional and I had issues with A5 chips (iPad2 for example) crashing on me. Could be that it can be mitigated with better library code though.
  • smaller vTable pointers, yes could be also a flag in the vTable size value. We don't need all 16bits to represent the number of fields, one bit can be spared to indicate if the relative offsets are 1 or 2 bytes long
  • remove 0 termination in strings. Yes please :). I read the concerns from @mikkelfj, but honestly I think 0 termination is just wrong, specifically as the string is utf8 encoded and can have 0 values. So relying on 0 byte to be end of the string is dangerous anyways. As a compromise we could have a special cstring type. This is what languages like Swift and Rust have. The native string representation is utf8, but for fast interop with C, there is zero terminated cstring.
  • speaking of strings, I would suggest to represent the length of the string with a varint format. Be it VLQ, FLIT, or something else. This would be a big win for short strings. FLIT suppose to be faster than VLQ, but I am concerned how linked C implementation ignores possible alignment issues.
  • 64-bit offsets, yes please and could we please allow cycles now :). I am ok if it is under a configuration flag and users need to explicitly opt in, sign it in blood on the schema definition. But being able to represent full graphs and not just DAG is big, as it opens up possibilities other formats can not allow. Specifically for object API. As object graphs can have cycles and we can encode them in FlatBuffers, just one to one. FlatBuffersSwift does it already and I am happy to help bring it to any other language. It would start with flatc identifying tasbles in schemas which have recursive (transitive recursive) definitions.
  • I have also ideas regarding possible, breaking / non breaking features for FlexBuffers, not sure if it is the right place to write them though.

mzaks avatar Apr 27 '20 07:04 mzaks

My personal View on the following:

  1. Construct the buffer forwards (rather than backwards like currently all implementations). This simplifies a lot of code and buffer management. Really like it, at least for swift, we would be able to handle stuff much smoother than the current implementation.

@maxburke

I would almost prefer an inversion of the current required/optional field status, having fields set to be required by default unless annotated to be optional.

@krojew

I absolutely agree with this. I would extend this a bit further to add a proper optional scalar fields. Adding bool flags if a scalar is present or not, as we need to right now if 0 is a legitimate value, is a quite frustrating workaround, given non-scalars have this built-in.

This is an amazing idea, actually. Which made me think if we can actually remove the vtables completely, since all the fields are going to be required, or optional as mentioned above can be presented as a Bool. This would allow us to remove the vtable, and use the Generated table that's already predefined.

example: when trying to encode the monster object, the following happens 4, 0, 0, 0, 0, 8, 0, 12, 0, 16, 20, 0, 24, 0, 28, 32, 36, 40, 44, 48, 0, 0, 0, 52, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 56, 0, 60, 64, 68, 0, 0, 0, 72, 0, 76, 0, 0, 80 -> Monster.name, offset. 4, 0, 0, 0, 0, 8, 0, 12, 0, 16, 20, 0, 24, 0, 28, 32, 36, 40, 44, 48, 0, 0, 0, 52, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 56, 0, 60, 64, 68, 0, 0, 0, 72, 0, 76, 0, 0, 80 -> Monster.name, this will be removed and a reference will be created to the old encoded table 245, 245, 250, 0. Now all we have to do is relate it the predefined table we have, where we can simply reference the number of the table instead.

mustiikhalil avatar Apr 27 '20 08:04 mustiikhalil

Regarding the optional / required table field status mentioned by @maxburke. I think changing current behaviour is a bad idea in regards to backward and forward compatibility. Marking fields as non optional is a "lie" we make for API convenience. There are no guaranties that a field is present, as we have no control over, who created the buffer, given your system is not a closed one. When we define a field as required, we commit to not being able to deprecate this field. We also can't set a new field as required unless we can guarantee that no older clients which don't know this field exists, will send buffers to newer clients which require this field to exist.

Speaking of forwards and backwards compatibility, enums is another blind spot. Specifically the way enums are converted to JSON. I think in order to grant proper backwards and forwards compatibility enums need to be always represented as numbers, even though it looks nicer as text in JSON. But that breaks with new cases and also if a case has to be renamed.

mzaks avatar Apr 27 '20 09:04 mzaks

I absolutely agree with this. I would extend this a bit further to add a proper optional scalar fields. Adding bool flags if a scalar is present or not, as we need to right now if 0 is a legitimate value, is a quite frustrating workaround, given non-scalars have this built-in.

There are two options how it can be solved in current FlatBuffers implementation.

  1. You disable default values and check if the value in vTable is 0 (I think most libraries have code for that)
  2. You introduce a struct which wraps the scalar value. As structs can't have default values and are zero cost if they have only one field, you get optional scalar value "for free".

I guess for the new version this kind of feature could be addressed in a more straight forward way, by being able to define a table scalar field as optional in the schema directly.

mzaks avatar Apr 27 '20 09:04 mzaks

Switching to required by default does not break future compatibility any more than the current attribute. On the other hand we gain possible optimizations and a more intuitive schemas. Today some fields are truly optional and require annotating with an attribute to make required, while other cannot be annotated as such and it's up to the users to guess if they're optional or not.

krojew avatar Apr 27 '20 09:04 krojew

Switching to required by default does not break future compatibility any more than the current attribute.

There are two options how one can design default behaviour:

  1. Convenience first. Enabling power users to be as "lazy" as possible.
  2. Safety first. Enabling novices and people new to the topic, not to shoot themselves in the foot in the long run, just because they did not understand all the details.

Setting optional per default is going with safety first approach. You can add required keyword to the schema any time if you decide that it is good for you. Switching from required to optional however is potentially dangerous in the long run.

On the other hand we gain possible optimizations and a more intuitive schemas.

Optimizations from technical (performance) perspective? I don't see it, but please I am open for suggestions?

Today some fields are truly optional and require annotating with an attribute to make required, while other cannot be annotated as such and it's up to the users to guess if they're optional or not.

When in doubt always check for null. Specifically if you use FlatBuffers for communication. If I know your server expects a required field I can send you buffers with null and crash your servers. This is a super easy DDoS attack angel.

mzaks avatar Apr 27 '20 09:04 mzaks

Setting optional per default is going with safety first approach. You can add required keyword to the schema any time if you decide that it is good for you. Switching from required to optional however is potentially dangerous in the long run.

This will be a bit anecdotal, but every time I have been introducing FB to new people, optional by default was quite surprising to them. Also, experience shows people tend to create schemas corresponding to their data model, so if something is required now, it gets marked as required. I believe switching to required by default will be the intuitive way to go. We'll never have complete safety or be future proof unless everything is optional always, which is not the goal.

Optimizations from technical (performance) perspective? I don't see it, but please I am open for suggestions?

Required things can always be stored inline. This is quite good for performance.

When in doubt always check for null. Specifically if you use FlatBuffers for communication. If I know your server expects a required field I can send you buffers with null and crash your servers. This is a super easy DDoS attack angel.

Scalar types don't have a notion of null. That's another thing I would love to see - take a look at Option in Rust. Also, verifying a buffer is a separate subject, so let's not mix it in.

Side note, is anyone working on a verifier for Rust, like in cfb?

krojew avatar Apr 27 '20 10:04 krojew

This will be a bit anecdotal, but every time I have been introducing FB to new people, optional by default was quite surprising to them.

Ok lets go with anecdotal πŸ˜€. In 2015 I worked on a city builder game where we stored user progress in FlatBuffers. The game was developed in Unity3D, but the town map itself was an isometric view. So a building position could be identified with a Position table which caries x and y as grid coordinates. In 2016 game designers introduced hills on the map. So Position carrying just x a y was non sufficient any more. We had to introduce z field. We introduced it and all worked perfectly smooth. Imagine we would introduce z as required field. What would happen? Probably nothing in development, as in development you mostly start from a fresh start, or have an admin tool to populate the city automatically. But when we would ship the change all the new version of the game would crash on start. Why? Because the stored game state have Position without z field and z field is now required. This would be a small disaster as it was a mobile game, deploying a fix on iOS can take days. So non of the existing player can play the game for days, you will have a short term impact of revenues going down and possible long term impact, of people installing another game of the same genre and abandoning your game all together.

What I am trying to visualise with this colourful example, is that with required being default the evolution of a schema becomes a mines field, which only people with experience will be able to avoid. To be honest with you, when we switched from Position(x, y) to Position(x, y, z) we would tap in the mine as well. It was our luck that we were protected by the sensible default behaviour of FlatBuffers. Because you are absolutely correct:

... people tend to create schemas corresponding to their data model, so if something is required now, it gets marked as required

Also regarding this:

We'll never have complete safety or be future proof unless everything is optional always, which is not the goal.

I am not sure why removing required altogether is not the goal? ProtoBuffers version 3 did it and I personally would vote for doing it in FlatBuffers too. πŸ˜‰

Anyways I think, I wrote enough. The decision lays with @aardappel.

mzaks avatar Apr 27 '20 13:04 mzaks

@mzaks I think there's a misunderstanding somewhere, because your example is quite wrong in this case. First of all, adding a new field to a table (required or not) will not break existing code (assuming the usual schema evolution guidelines). Second - there's domain data known to be required and that gets marked as required, which is perfectly ok. Arguing that everything should be optional is over-defensive and will only lead to frustrations of not having the ability to express the data model properly in messages (with the additional burden of fact-checking every piece of data). Third - saying that someone did something is not a valid technical argument discussion :)

That being said, my opinion on optional data for a future protocol, if it ever happens, is:

  • Make required the default.
  • Add proper optional support for scalar types.

krojew avatar Apr 27 '20 14:04 krojew

I agree strongly with @rw that using a format specifier as part of the header would definitely be a great way to evolve the format.

Lots of great comments on this thread, here's my thoughts to add to the mix:

  1. I really don't like the idea of removing all padding. One of the major benefits of FlatBuffers is the ability to read into memory and use, this removes this possibility for many architectures. If this functionality is preserved with a flag for those odd CPUs (i.e most) that require natural alignment to be efficient then I guess that's ok.
  2. Moving unions into a single field sounds good. 6-bytes vs. 8-bytes .. why though? If a developer is looking for a compact format - FlatBuffers definitely trades speed for size - then perhaps they could try something else?
  3. Removing the size field of vtable seems reasonable.
  4. Since this is code-gen'd sounds reasonable though more complicated when inspecting the format.
  5. Inline vtables SG assuming this can be handled at codegen time.
  6. Inline strings / vectors SG assuming this can be handled at codegen time.
  7. Removing 0-termination of strings seems like a recipe for things blowing up in special ways, especially since I would assume most C/C++ code still assumes zero terminated strings. It again seems like a trade-off between size / speed, without a zero terminate string C/C++ code potentially has to copy which seems bad. Perhaps for cases where the developer knows up front they could use a flag to strip the terminator and the C/C++ code would generate accessors that potentially copy if a caller needs a zero terminated string (yuck).
  8. 8-bit and 16-bit size fields sound ok assuming the conditions are evaluated at code gen time. Again, feels like a size vs. speed trade-off. Perhaps consider making this sort of thing optional?
  9. Constructing the buffer forwards would certainly more intuitive, though why did you construct backwards again? Per 26a30738a4fa4e92300821fd761764ec8df2dcf2 (wow that was a long code review a long time ago) The current implementation constructs these buffers backwards, since that significantly reduces the amount of bookkeeping and simplifies the construction API. no longer the case?
  10. Definitely always include the file identifier, a couple of bytes always being present allows the format to evolve. I vaguely remember a discussion about this :)
  11. For 64-bit offsets from the starting vector seems like a reasonable trade-off for flexibility.
  12. Anything that makes mutation easier would be great. Even in the case with shared v-tables - I know the implementation is more complex because you're basically doing copy-on-write - it would be great to more efficiently handle mutation and packing of a FlatBuffers so that use cases of other well used RPC formats can be replaced ;)
  13. Flexbuffers integration would be great to mix schema-less data into the API. Would be far better than implementing schemaless data in Flatbuffers which ends up being a common pattern.
  14. Nested vectors could be neat, not critical though, it's a tiny bit a math to convert a flat to nested vector.
  15. Built-in variable sized integers again sound like a space vs. speed trade-off, if fields are explicitly marked as variable sized. Before implementing a feature like this it's probably worth figuring out what the size savings and time cost would be for functionality vs. laying other compression schemes on top of FlatBuffers.

stewartmiles avatar Apr 27 '20 16:04 stewartmiles

Support for indexed tables where a flatbuffer table is annotated with "key" and "value" annotations. This sounded like a niche use case - but I believe it's an important one and likely to be occupied by something else if not addressed.

adsharma avatar Apr 27 '20 16:04 adsharma

I think 0-terminated strings are not an issue. Removing C-style string getters would solve it.

krojew avatar Apr 27 '20 16:04 krojew

  1. Remove all padding.

It is possible if use bit_cast<T> whenever in C++ where access to scalars. It will put C++ implementation on an equal footing with other languages than can’t use reinterpret cast. It would be better to assume that alignment (padding) of a field may be a random number in the range [0..N] (not a power of 2). Internal implementation should not depend on a user-defined alignment that can be specified for every field in a schema.

  1. Allow different size vtable offsets.
  2. Allow 8 and 16 bit size fields on strings and vectors, currently they're always 32.

For vtable offsets and types ASN1 or variable-length encoding can be used. It may be a good idea to make the length of fbs-message aligned to 4 or 8 to pre-fetch data without extra checking. This is simple 1:2, or 1:4, or 1:8 decoder:

  auto x = load_uint32(bytes); // it can be uint16 or uint64
  auto offset = ((0u - (x & 1u)) & x | (x & 0xFFu) ) >> 1u; // [0x00-0x7F] or [0-0x7fffffff]
  bool is_1 = !(x & 1u);

FlexBuffer

Fast FlexBuffer with writing to pre-allocated memory (without any memory allocations) can be useful as a fast logging core.

vglavnyy avatar Apr 27 '20 19:04 vglavnyy

@mikkelfj not sure why you refer to some of these as "slow": like I said, these are all intended to be codegen-ed (or become a template/macro arg) so will all be maximally fast, certainly no slower than any existing feature. None of these feature are intended to have dynamic checks.

Not sure what you mean by strong JSON integration. JSON is text and not random access, so serves an entirely different use case than FlexBuffers. I am not sure what it would even mean to use JSON in this case, other than to store a string.

A new format could mean a new way to do file_identifier, I'd be open to that. Could be variable length.

"drop nested tables" .. what does that mean?

aardappel avatar Apr 27 '20 20:04 aardappel

@AustinSchuh forward building would be a big format change, since now children typically come before parents, and unsigned child offsets would be flipped to always point downwards in memory instead. Retaining the feature that these offsets always point one way and never can form a cycle would be good, I think.

A lot of people use FlatBuffers in cases where sparse / random access, or use with mmap are important, and those don't work with an external compression pass. I personally think there's a lot of value storing things more compactly in memory that is directly useable. Or at least, that is what FlatBuffers specializes in. None of these more compressed representations should be any slower (in fact, faster) than the current representation, they mostly just complicate codegen.

aardappel avatar Apr 27 '20 20:04 aardappel

@maxburke the varints would be very much optional, as they are definitely slower to read. They would be a type, so you'd explicitly write either int (as right now) or varint for a field.

See above about compression and efficiency.

Default required would be problematic for an evolving format. Protobuf came to the opposite conclusion after many years of experience, and FlatBuffers went along with that.

aardappel avatar Apr 27 '20 20:04 aardappel

@AustinSchuh

The offset for the scalar fields in the vtable is 0 if they aren't populated

Yes, and that means that the value is equal to the default, not that the value is not present. So can't be used to test for this purpose.

I agree that not being able to differentiate this has been something many users would have wanted. On the other hand being able to access scalars "blindly" without checking for presence, and the storage savings from defaults is not something I would want to miss.

aardappel avatar Apr 27 '20 21:04 aardappel

@rw

Maybe a varint?

That would make the distinction dynamic, and costly. Besides, we need to index into this table, which makes varints useless unless all the same size. The idea is that this is a static, codegen feature, associated with a particular type of table.

Would it be possible to have most (all?) of these features be specified with flags at the beginning of a payload?

No, for the same reason. It's extremely important FlatBuffers stays fast, so can't rely on dynamic checks for different encodings. Besides, the idea is to specify them per type, not per buffer.

aardappel avatar Apr 27 '20 21:04 aardappel

@mzaks Unaligned accesses will crash on ARM if the C/C++ (or Swift) code was generated assuming alignment, since unlike x86 they are different instructions. But it is pretty easy to portably generate unaligned instructions (from C/C++ at least).

I wouldn't want to add more branching to vtable access, so this can't be a dynamic flag.

I already suggested being able to specify the type of the size field of a string, so varint would be a great option there :) It could even be the default, given that 99% of strings are probably < 128 bytes, and reading a 1 byte LEB is very fast.

We can allow cycles if someone invents a verifier that can deal with cycles and uses no memory :) I personally do not care for cycles but that is a discussion we can have if it ever comes to this.

I guess if the imaginary time comes when we get to make a V2 of FlatBuffers, also revving FlexBuffers at the same time would make a lot of sense. I haven't thought too much about what I would change there.

aardappel avatar Apr 27 '20 21:04 aardappel

@mustiikhalil if you wanted to remove vtables and make everything required, then your ideal format already exists: it's called Cap'n Proto! ;)

aardappel avatar Apr 27 '20 21:04 aardappel

I recommend we do not go to far into the "required vs optional" default discussion, because in the end this is an API/schema feature, not a binary format feature, assuming you want to keep vtables.

aardappel avatar Apr 27 '20 21:04 aardappel

@stewartmiles

  1. I think ARMv5 is the last mainstream architecture that couldn't read unaligned with a single instruction, or am I missing something? Given that this would be a forward looking format, I think either not supporting chips that old or it compiling down to multiple instructions for compatibility is acceptable.
  2. 6 bytes is simply enabled by the lack of padding, much like many features on this list. Saving 2 bytes is not the biggest deal, but it should not be any slower to access, so why not?
  3. The current format is already unparsable without a schema, and yes, this will make it slightly worse. Also reflection code will be slower and more complicated.
  4. Yes.
  5. Yes.
  6. I don't think there's a lot of C++ code still out there that would touch that byte, but you're right that if it does, it be hard to track. Worth considering.
  7. Yes. And it would be optional (default to the current 32-bit).
  8. Thanks for digging that up :) I think the bookkeeping was under the assumption that you want the end result to be parents in memory first, which would require fixups. But there really is no reason why the parents can't be later in memory.
  9. Yup, I think so! Tie-ing it to unshared vtables would be a good simplification, I think. Fun bit of FlatBuffers history: my absolute first design for FlatBuffers had unshared & inline vtables of 8-bit offsets (with offsets scaled to size of thing pointed to, to in theory allow up to 2048 byte tables). But discussing this with you and others it became clear that 16-bit was a safer bet. But now the tables became bigger, so sharing was introduced, and with it, the offset indirection :)
  10. Yes, this would be an explicit type. Not a fan of compression on top of FlatBuffers as it loses so much of its unique properties that way.

Edit: added blank numbers, since github markdown renumbers these to be contiguous if not present??

aardappel avatar Apr 27 '20 21:04 aardappel

@adsharma We already have this? You can annotate a field as key (and the other fields are effectively the value) and then use it with binary search lookups if you place these in a vector.

aardappel avatar Apr 27 '20 21:04 aardappel

@vglavnyy std::bit_cast takes a const T &, which you cannot legally construct an unaligned version of, so doesn't seem that useful? To be fully correct we'd have to go via std::memcpy which all compilers luckily can optimize into a single memory load.

aardappel avatar Apr 27 '20 21:04 aardappel

@rw

Would it be possible to have most (all?) of these features be specified with flags at the beginning of a payload?

No, for the same reason. It's extremely important FlatBuffers stays fast, so can't rely on dynamic checks for different encodings. Besides, the idea is to specify them per type, not per buffer.

I still think this is worth looking into more. The reason is that this could open up a FlatBuffers ecosystem, beyond just one new FlatBuffers version. Maybe there's a way to have a compile-time fast path for AoT feature flag specifications, alongside a slightly-dynamic mode of operation, too.

rw avatar Apr 27 '20 23:04 rw

@AustinSchuh

The offset for the scalar fields in the vtable is 0 if they aren't populated

Yes, and that means that the value is equal to the default, not that the value is not present. So can't be used to test for this purpose.

I agree that not being able to differentiate this has been something many users would have wanted. On the other hand being able to access scalars "blindly" without checking for presence, and the storage savings from defaults is not something I would want to miss.

I forgot to mention, we run with ForceDefaults(true) everywhere. The presence of that option, support for per-field metadata in the schema, and zero copy reads were the reasons we chose Flatbuffers. Once you do that, has_ works as expected. There is also nothing saying that has_ and returning the default when unset are exclusive. Flatbuffers today return the default if the field isn't populated.

AustinSchuh avatar Apr 28 '20 00:04 AustinSchuh

@aardappel

We can allow cycles if someone invents a verifier that can deal with cycles and uses no memory

First thing would be to enhance flatc to detect recursive table definitions. I should be able to do that, I already done it for my code generators. The verification and table build API can stay as is if there are no recursions in the definition. If a cycle is detected flatc can check if a special attribute (e.g. allow_cycles) is present in the schema. If it is not present the generated code states the same as if there would be no cycles. Verification is still fast, building a buffer with standard API does not support building a cyclic graph anyways. But if the user explicitly marked the schema with allow_cycles, then flatc produces additional builder API for second pass building step. This inclines that we can add dummy table reference in the first pass and replace it with actual table reference in the second pass.

Verification can be done in two steps as well. In first pass we do the fast and established verification. However if the reference value is pointing backwards and users enabled allow_cycles then keep the backwards references in the memory and check in the second pass if the backwards references are pointing to actual tables.

This would follow the premise, keep fast things fast and slow things possible, if necessary.

mzaks avatar Apr 28 '20 07:04 mzaks

@adsharma We already have this? You can annotate a field as key (and the other fields are effectively the value) and then use it with binary search lookups if you place these in a vector.

I don't think the current code is good enough to support a database with multiple key fields and being able to construct a valid flatbuffer with zero copy from a key+value read from a store.

Previous work from 2016:

https://drive.google.com/open?id=0BwPRLHxpLD-adVM2UVNFMVFCUm8 https://github.com/adsharma/flatbuffers/commits/byteorder2

adsharma avatar Apr 28 '20 22:04 adsharma

it would be awesome to have more scalar types available:

  • uint128 used eg in UUID
  • uint160 used eg in Ethereum for blockchain addresses
  • uint256 used eg in Ethereum for all numbers
  • decimal128 decimal floating point https://en.wikipedia.org/wiki/Decimal128_floating-point_format - used eg in accounting/databases

oberstet avatar Apr 29 '20 10:04 oberstet

it would be awesome to have more scalar types available:

This can be done with fixed length arrays in the current format. However, a typedef in the schema would be helpful.

mikkelfj avatar Apr 29 '20 10:04 mikkelfj

I think having a uuid as a built-in type would be useful. It's so popular, I think having it, even as a sugar for an array, will benefit users.

krojew avatar Apr 29 '20 10:04 krojew

That's what a typedef will do for you. Meanwhile it can be wrapped in a struct.

mikkelfj avatar Apr 29 '20 11:04 mikkelfj

This can be done with fixed length arrays in the current format.

can fixed length array types be used the same as built-in scalars (say unit64)? would be great, because if so, yes, a built-in typedef would fully solve this ..

another source of difference might be the binding code generators producing different bits for fixed size arrays vs scalars?

oberstet avatar Apr 30 '20 10:04 oberstet

can fixed length array types be used the same as built-in scalars (say unit64)? would be great, because if so, yes, a built-in typedef would fully solve this ..

No but they can be used for things like UUID's or crypto values. A typedef solution would be able to define uint64 to mean something else and preserve underlying int type. Regardless, this has nothing to do with FB2 - it can easily be added to current FB if so desired. You could also have a struct with just a uint64 member to simulate a typedef, but that isn't very convenient.

mikkelfj avatar Apr 30 '20 11:04 mikkelfj

@AustinSchuh but force defaults potentially wastes a lot of space, so I don't think it's a particularly good solution. Particularly calls like Create which serialize all fields are assuming default values are not serialized.

aardappel avatar Apr 30 '20 16:04 aardappel

@mzaks yes, as opt-in with some kind of allow_cycles it could work, if there's enough demand for such a thing. We'd need signed offsets in general then though, or generate code to interpret them as signed with allow_cycles on.. that will complicate the library code a lot.

aardappel avatar Apr 30 '20 16:04 aardappel

@adsharma multiple keys would also require multiple sort orders to be efficient, or indices separate from the vector containing the values.. the current solution is simple because it works almost transparently on existing vectors.

If you wanted multiple sort orders, you could do this today by storing multiple vectors. But yeah, we'd need some schema support to indicate which key is used with which vector.

aardappel avatar Apr 30 '20 16:04 aardappel

@oberstet @mikkelfj @krojew I'd say a struct makes for a pretty good typedef already: struct UUID { v:[uint:4]; }

We could consider native typedef, but I am not sure what it would add. The problem with typedef in C/C++ is that you typically don't want to allow arbitrary mixing of the new type with the type it's based on, which is exactly what struct already prevents.

Of course we have no way to typedef strings/vectors currently, so who knows that's useful.

aardappel avatar Apr 30 '20 16:04 aardappel

If I were to use flatbuffers instead of SQL DDL to represent a database, we would have one flatbuffer for the primary key + data and one flatbuffer for each index, where the key is a sequence of fields making up the secondary fields and the value is the primary key (typically an id).

This is a powerful use case for copy elimination on fast devices such as NVMe or even in-memory.

On Thu, Apr 30, 2020 at 9:37 AM Wouter van Oortmerssen < [email protected]> wrote:

@adsharma https://github.com/adsharma multiple keys would also require multiple sort orders to be efficient, or indices separate from the vector containing the values.. the current solution is simple because it works almost transparently on existing vectors.

If you wanted multiple sort orders, you could do this today by storing multiple vectors. But yeah, we'd need some schema support to indicate which key is used with which vector.

β€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/flatbuffers/issues/5875#issuecomment-621966235, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFA2A43UMAOHXSXCBROIA3RPGSMJANCNFSM4MRLM4MA .

adsharma avatar Apr 30 '20 16:04 adsharma

@adsharma @aardappel I'm not reading all comments closely ATM, but regarding multiple keys, then FlatCC (for C) already supports multiple keys. FlatCC does not automatically sort on a key while the buffer is constructed for technical reasons (buffer representation is not complete, and you might be streaming). Thus, sorting is done in the reader API: you cast a vector to mutable, then call the sort method. This allows to sort on one of several keys. The first key in a table is considered the primary key. A table has a sort method that will recursively sort all nested members by the default key. Because the first key is not always the one of interest, it is possible to use attributes to modify this. There is also a 'sorted' attribute, to vectors are not sorted in a recursive sort call unless a vector is marked as sorted. All of this is fully compatibly with current single key behaviour in the C++.

mikkelfj avatar Apr 30 '20 16:04 mikkelfj

If I were to use flatbuffers instead of SQL DDL to represent a database, we would have one flatbuffer for the primary key + data and one flatbuffer for each index, where the key is a sequence of fields making up the secondary fields and the value is the primary key (typically an id).

I am using such a setup for large distributed low latency write once read many solution. There are multiple records per buffer, but there are many more flatbuffers overall.

mikkelfj avatar Apr 30 '20 17:04 mikkelfj

Right - the cases I was optimizing for were non-streaming, typical database use cases, where the key is specially constructed in such a way that the tuple sort and a byte sort in the key value store have the same outcome. This method is used in pretty much all consumers of rocksdb, including MyRocks and Cassandra. I think the key difference is that the sort is done at write time in these cases by the key value store interpreting the flatbuffer as a byte array.

Good to hear that this is a use case of interest for you as well.

On Thu, Apr 30, 2020 at 10:08 AM MikkelFJ [email protected] wrote:

If I were to use flatbuffers instead of SQL DDL to represent a database, we would have one flatbuffer for the primary key + data and one flatbuffer for each index, where the key is a sequence of fields making up the secondary fields and the value is the primary key (typically an id).

I am using such a setup for large distributed low latency write once read many solution.

β€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/flatbuffers/issues/5875#issuecomment-621983764, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFA2A3K2VBH6ZT3O5SLJVDRPGV7XANCNFSM4MRLM4MA .

adsharma avatar Apr 30 '20 17:04 adsharma

Comments

Make unions into a single field (and vectors of them into a single value). Also make the type part 16-bit while we're at it, so a union is always a 6-byte struct.

Why not allow specifying the backing type of a Union? union MyUnion : uint16

Always have a file_identifer, and make it the first thing in the buffer? Always have a length field as well?

+1 to length field.

The offset for the scalar fields in the vtable is 0 if they aren't populated

Yes, and that means that the value is equal to the default, not that the value is not present. So can't be used to test for this purpose.

Could we use 0 as (use default) and 1 as (was set to the default) since an offset of 1 would never point to a valid location (tables always start with an soffset_t to their vtable, so it shouldn't be addressable if a table immediately follows a vtable).

Or if can point to a valid location, could we rearrange the vtable structure so the size is at the end and an voffset_t value <= the offset to the vtable size would have a special meaning.

dbaileychess avatar Apr 30 '20 18:04 dbaileychess

  1. For a buffer that has entirely un-shared vtables (see 5), it now becomes more feasible to allow in-place mutation of more complex things. This is definitely a complex/contentious feature, but I think if we ever re-booted the format this should be designed in from the start if possible.

For scalar fields that are 16-bits or smaller, they could be placed directly in the vtable itself which removes one indirection, and saves at least 16-bits / field.

dbaileychess avatar Apr 30 '20 20:04 dbaileychess

@aardappel on slow: I am referring to branching. Whenever the reader has to make a choice, things slow down. Currently Flatbuffers is a half the speed of optimally reading a native struct if the buffer fits in L1 cache. This is pretty good considering there is a vtable. If there is an extra check, the speed can easily become 1/3 of native speed, or even more. It can improve through predictive branching, or worsen due to load on instruction cache. If there are even more choices the effect becomes worse. This is largely why protocol buffers is slower than flatbuffers.

mikkelfj avatar Apr 30 '20 21:04 mikkelfj

@dbaileychess

  • Yup we can allow specifying the size of the union type, and maybe default it to 16 to protect those that are going to otherwise run into trouble later :)
  • Putting a value like 1 would have a slight cost, since even checking <= 1 as "offset is not valid" is typically slower than a test for 0.
  • Sticking values directly in the vtable would be awesome.. but only possible if the table was declared to be not shared. I guess this would be a further advantage of making it not shared, if you have many small values. At further complication to code generators, and making reflection yet harder. But I think it be worth it.

aardappel avatar May 01 '20 00:05 aardappel

@mikkelfj there would be no branching. The features I am proposing are all at codegen time, at runtime, the accessor code will only expect one kind of format, much like it does now.

aardappel avatar May 01 '20 00:05 aardappel

yes, as opt-in with some kind of allow_cycles it could work, if there's enough demand for such a thing. We'd need signed offsets in general then though, or generate code to interpret them as signed with allow_cycles on.. that will complicate the library code a lot.

Technically current offsets are signed 32 bit ints. They are defined as unsigned but nobody can use the 32th bit anyways as buffer size is capped at 2GB. Speaking of the offset and complicating things :). What about being able to define the type / size of the offset in the schema? The offsets are relative and given small payload, say for data communication, a u8 offset might be sufficient for some cases, reducing the payload size dramatically. It's also ironic that by reducing the offset type size we increase the possibility for a relative offset to be representable as a smaller type.

I agree that it will introduce "combinatoric" complexity of the library code, as we will have to provide methods for all possible offset types to be computed from relative value. But this code should not be "algorithmically" complex. What I mean by that, the code is not hard to write it is just more of it. The generated code will pick proper methods to read and build buffers based on the schema definition, so performance wise there should be no penalties.

We would need to have a default which is safe to use in 90% of cases (e.g. keeping the i32 as default)

mzaks avatar May 01 '20 08:05 mzaks

@aardappel "drop nested tables" .. what does that mean?"

The support to store a flatbuffer table inside another table in a uint8 array has taken at least 40% of the total development time in FlatCC if not more. And it is my impression that if it has been easier elsewhere, it has been because it was not done correctly. If padding is removed some of this might go away. But I'd rather just not have nested tables. If you really need them you can always store them in a uint8 blob, but without library support.

mikkelfj avatar May 01 '20 09:05 mikkelfj

Regarding keys lookups in FlatBuffers and generally maps in FlexBuffers. A friend of mine pointed out that "Eytzinger layout" might be more efficient than sorted keys + binary search, specifically if the number of keys is large. I think it is something which might be considered/evaluated for the next version of FlatBuffers

mzaks avatar May 03 '20 09:05 mzaks

Thanks for the proposal, Wouter. We found item 10 and 11 very helpful in our use cases. It will be nice to support them in the new binary format.

lu-wang-g avatar May 04 '20 18:05 lu-wang-g

@mzaks interchanging signed and unsigned is not quite as easy in C++ as you make it sound.. you can't just make it a typedef an hope other code keeps working. Working with these types is tricky to get right, as C++ is a minefield of weird conversions and undefined behavior.

I'd be totally in favor of supporting or even defaulting to an "Eytzinger layout".

aardappel avatar May 04 '20 19:05 aardappel

@mikkelfj I don't understand how it could take 40% of development time, it certainly was a <1% feature in C++ and most other languages. It has been rather useful for a lot of users.

aardappel avatar May 04 '20 19:05 aardappel

It's mostly because correct padding introduces an endless number of special cases to ensure correct alignment within the parent container and also because vtables must be contained within the local region of the buffer. From what I have seen elsewhere alignment wasn't done at all, so the minimal 4 byte alignment was used, and later 8 bytes because that is mostly good enough - but that doesn't make it correct. Every time a new feature is added, there is a lot of extra complexity in handling the nested buffer case.

mikkelfj avatar May 04 '20 19:05 mikkelfj

Additionally, there is a lot of extra interface code, like constructors for adding a table as a nested buffer, or functions to add fields to a table that is the root of a nested buffer. It's duplication of a large volume a functions. Of course, if you create the buffer separately and just added as a blob with a given alignment it becomes a lot easier - and that is essentially what I'm suggesting - no inline nested buffer support.

mikkelfj avatar May 04 '20 19:05 mikkelfj

@mzaks on Eytzinger layout: I'd like to keep things simpler, although it is a good suggestion. There are endless way you can optimize a buffer index, not least separating the key in a vector separate from the payload. A basic sort is something that everyone understands and which is also useful on its own for non-seach reasons.

For fast indexing I have been using hash table indices stored in buffers separate from the indexed content using the simplest possible efficient hash function (fnv1a) in order to make it easy to generate the buffers by other developers.

FlatBuffers is very suited for playing such index games, but I think the core functionality should be simple and efficient, not aggressively optimized. It would raise the barrier for efficient portable implementations and potentially add code bloat. The current flatcc search function is either a standard library binsearch call although I could easily have added a more optimized search.

I could imagine FlatBuffers having the option to have one vector index another vector either using a sorted key or a hashed key based on linear open addressing fnv1a hasing, but it adds complexity and it is not difficult to custom build such logic for specific use cases.

FlatCC also adds linear scan search on non-indexed fields because some users wanted to be able to search arbitrary elements in a collection more than they wanted speed. This was not as easy to add as it sounds due to the lack of templates or (portable) generic functions in C so it there was quite some work on macros.

There are other aspects that would be nice such as indexing by two keys at the same time in case of conflicting primary keys, but again, it starts getting complex and FlatBuffers shouldn't be a full blown database.

mikkelfj avatar May 05 '20 07:05 mikkelfj

On sorting, I'll add the FlatCC does not sort while the buffer is being created because it doesn't work with the streaming pluggable buffer generation used by FlatCC. Instead the reader has logic to sort before searching, if needed. I already mentioned this, but the point here is that the sort is done inline using a very fast heap sort implementation that requires no additional memory - but for the same reason the sort is not stable. Adding an Eytzinger layout would resemble a heap so perhaps there is a fast algorithm, but most likely it would require more memory and complexity to handle, and the reader code should be light.

mikkelfj avatar May 05 '20 07:05 mikkelfj

@mikkelfj I agree with you regarding complexity, but as someone mentioned here already, there should be a focus. If the primary focus of FlatBuffers is speed, then it makes sense to evaluate a technique which supports it.

It is a big step, this is why I think there should be some research done first. When we have the data, it will be easier to decide if it the deviation in performance justify the complexity in code.

mzaks avatar May 05 '20 07:05 mzaks

One more idea, which could be introduced in a new version of FlatBuffers.

Trusted Buffer

Motivation:

Introduce a possibility to create a hash, based on the stored data and private key before finishing the buffer. The hash value / signature is stored in the buffer between the first offset and the actual data and can be validated before accessing the values of the buffer. As a result the recipient of the buffer can validate that the buffer can be trusted and was not tempered by a 3rd party, or network issues. There for Trusted Buffers do not have to be validated and can be used for similar purposes as JWT (in case people are not familiar with JWT I found this video provides a nice explanation https://www.youtube.com/watch?v=soGRyl9ztjI).

Needed changes:

I think the binary format does not have to change. The signature can be places after the "headers" - pointer to the first vTable, file extension chars etc.. and before the actual data. Given first vTable offset I assume it should be possible to identify the start of the actual data and there for the hashed buffer. In the schema there should be a possibility to define that the buffer should be signed and the hashing algorithm, which should be used. The builder need to provide a method to get a buffer (input for the hash function) and to store the signature. The "reader" needs to provide a method where users can get the signature and the signed buffer. IMHO the hashing should be responsibility of the user.

mzaks avatar May 08 '20 08:05 mzaks

Adding HMAC doesn't need to be incorporated into the format. Even if it would, the buffer still could not be trusted - the inside data might still be invalid for whatever reason (e.g. version mismatches) or the key might have been compromised and using the buffer would result in a crash.

krojew avatar May 08 '20 08:05 krojew

I agree with @kraj - this is feature creep, and choosing a proper HMAC is a rats nest. If you transfer a flat buffer over a HTTPS or a QUIC connection you already have authentication although you may have to do more to verify the source. I agree that there can be cases where you want a signature, such as using FlatBuffers for packet managers, but here there would a be domain specific standard and signatures are typically stored in a separate file.

mikkelfj avatar May 08 '20 08:05 mikkelfj

Lets have a look at JWT (internals of JWT explained in this video if you are not familiar with it https://www.youtube.com/watch?v=_XbXkVdoG_0). As mentioned in the video at the end, a key can be compromised in JWT as well. The solution is to black list and change the key.

@krojew I agree that there is no 100% certainty specifically if you are talking to an untrusted party. However if you are talking to a trusted party, or an untrusted party is forwarding a message created by a trusted party (see JWT) a signature is "good enough", just to make sure that the message was not tempered with. A trusted entity can be trusted to have correct version, so version mismatch should not be a concern. An untrusted party which forwards a trusted message could try to crack the key, but that would be a brut force attack. And this is something that JWT is vulnerable against as well. FlatBuffers is actually much more secure than JWT because JWT contains pure self described JSON and the information about the hash function, so an untrusted forwarding party can read, the data and make assumptions. In case of FlatBuffers, the untrusted forwarding party does not know the schema and there for would need a considerable amount of time and effort to even read the data / reverse engineer the schema. Before it could try to crack the key, without knowing which hash function was used.

@mikkelfj I understand the concern of feature creep, but as I discussed in "Needed changes" section. The actual changes should be minimal and do not have to be implemented in all languages at once. If I would want to replace JWT with Trusted FlatBuffer, only my servers which are probably written in one language need to be able to read and verify the token (Trusted FlatBuffer). The clients handle the token as is. They receive the token from the server and include it in every HTTP request header without the need to verify or read it.

mzaks avatar May 08 '20 10:05 mzaks

You can't compare FB to JWT, since JWT will not crash your process when it contains invalid data. Signature can only verify you're talking to someone presumably trusted, but even in this case the data might still be invalid. Which means you always have to verify external messages, hence there is no notion of a trusted buffer in such case. Only a (probably) trusted sender. I've been using HMAC-SHA with FB for a long time now, and message validation is always required nevertheless. It's just too risky with languages without memory safety, like c++.

krojew avatar May 08 '20 10:05 krojew

@krojew

You can't compare FB to JWT, since JWT will not crash your process when it contains invalid data.

It all depends on how your code access the data JWT encapsulates. If your code assumes for some properties in the payload part of the JWT to be required πŸ˜‰, your code will crash (throw an exception) if the payload does not contain those properties. So no real difference here.

When we read data from the FB, most readers are generated with focus on speed, so there are no extra validation steps, so validating an untrusted buffer upfront is an important option. You could (and I did in FlatBuffersSwift) generate a reader which does not crash on badly structured buffer. But this reader will be slower, because of the checks it has to perform on every value extraction.

Now in case of JWT, the payload is produced by me and read by future me (because I am stateless). If I can't trust myself and have to validate my own messages, there is something really wrong with me πŸ˜€. The only thing that I need to validate is that nobody changed the message which I wrote.

mzaks avatar May 08 '20 11:05 mzaks

It all depends on how your code access the data JWT encapsulates. If your code assumes for some properties in the payload part of the JWT to be required πŸ˜‰, your code will crash (throw an exception) if the payload does not contain those properties. So no real difference here.

There's a fundamental difference between parsing JSON and assuming a property exist, and directly accessing unverified memory. You have the ability to react to non-existent properties. You don't have any ability to react to invalid memory access.

Now in case of JWT, the payload is produced by me and read by future me (because I am stateless). If I can't trust myself and have to validate my own messages, there is something really wrong with me πŸ˜€. The only thing that I need to validate is that nobody changed the message which I wrote.

Assuming you are absolutely sure you're the sender (which you can't be), you still can have potential bugs. This with the nature of unchecked memory access, can lead to nasty crashes. Bottom line - don't trust yourself.

To add more counterpoints for HMAC:

  1. Implementing it in a generic way is quite complicated.
  2. Implementing in a non-generic way is shooting oneself in the foot.
  3. It's a pessimistic approach when not needed, e.g. in secure closed networks. You would need to disable "trusted buffers" in such case precisely because you trust your buffers.

Any kind of signing should be a part of a larger protocol, not a serialization format.

krojew avatar May 08 '20 11:05 krojew

@krojew I guess we just need to agree to disagree 😁. We both stated our arguments. The decision is anyways not in our hand. An I am for one quite happy it is not πŸ˜€.

mzaks avatar May 08 '20 11:05 mzaks

Sorry, one thing I would like to add though. Regarding this:

Implementing it in a generic way is quite complicated.

I don't see it as complicated at all. Let me explain:

Before you finish the builder, builders internal buffer contains all the tables and other values. The builder also knows the exact size of this buffer. So the builder can give this buffer (or just memory pointer + size) to user for them to create a signature. Don't see anything complicated here. The user then finishes the builder and provide the signature. In the schema we defined the hash algorithm. We don't care about the actual algorithm applied, but we do care about the hash length. We can check that the byte length of the provided signature is as expected and prepend the signature to the buffer before finishing it. Meaning that the resulting buffer looks something like this:

[First vTable Offset] [other meta data] [Signature] [Data]

This buffer can be read even by current FB reader as the [Signature Block] would be jumped over due to the value of [First vTable Offset].

A reader which choses to verify the buffer should be able to ask for [Signature] block and the [Data] block. Getting those is quite easy, because the size of [First vTable Offset], [other meta data] and [Signature] block is described in the schema. So given [Signature], [Data] and the key, user can verify if the data was tempered with.

Buffer verification is an optional step, only interested parties can perform, don't have too! So I think point 3:

You would need to disable "trusted buffers" in such case precisely because you trust your buffers.

Is not applicable as you don't have to add the signature and you also don't have to verify the signature. It is all pretty much optional, no need for disabling. You just don't use it.

mzaks avatar May 08 '20 12:05 mzaks

I don't see it as complicated at all...

You forget that FB would need to support any current and future algorithms out-of-the box to preserve broad compatibility and security, if it was to be built-in. You describe a case when it's not built in and this is the exact case I talked about - HMAC should be a part of a user protocol, not FB. I use HMAC with FB today using exactly this - by adding protocol layers. We don't need FB to support any signature at all for this to work, since we already have this working now.

Is not applicable as you don't have to add the signature and you also don't have to verify the signature. It is all pretty much optional, no need for disabling. You just don't use it.

Exactly my point - if you trust your buffers, you need to disable the "trusted buffers" feature. That does raise a red flag at the whole design.

krojew avatar May 08 '20 12:05 krojew

You don't want to pull in a host of crypto libraries and JSON parsers to deal with FlatBuffers in a conformant manner. If you have a use case for JWT, by all means, add the libraries and add a signature after the end of buffer and add a size field to the FB header so you can find it, but no reason to make it integral to the format. I could almost agree to a Blake2B signature, but then you have to consider symmetric vs asymmetric secrets, and then that breaks down. You are better off authenticating your comm channel, such as adding authentication to MQTT or whatever transport.

mikkelfj avatar May 08 '20 12:05 mikkelfj

Another crypto idea I actually have had for while, is a signature of the schema in a merkle tree style. Each release of the schema is hashed using a Blake2B hash after some space normalization. The hash is concatenated to the previous version hash or null, and this becomes the final version hash. Assuming the release history is known and not too frequent, a buffer can use this hash in its header to identify the type of buffer content it is supposed to hold. The root table could also be included as a fully qualified name included in the hash. But even this gets quite complex.

mikkelfj avatar May 08 '20 12:05 mikkelfj

Ok now I think I understand your main concern. I totally agree that FlatBuffers should be completely separated from the hashing function / algorithm.

The only thing that FlatBuffers could provides is an ability to embed a signature if one choses to do so, removing the complexity of having a custom envelop where buffer and the signature is stored. We don't even have to couple the signature to the hash function in the schema. We can have a declaration like:

signature_size: 256;

So any byte array of 256 bits can be stored before the [Data] block. Users can get [Data] block on build to create a signature and get [Data] and [Signature] block on read.

This is it. No external dependencies inside FlatBuffers. All the complexity of creating the signature, storing keys, verifying that signature is correct, lays on the user outside the FlatBuffers. We just take away the hassle of storing the signature in the payload.

So it is like JWT, but the Header part is defined in FlatBuffers schema and the signature part is "hidden" inside of the buffer.

mzaks avatar May 08 '20 12:05 mzaks

You don't need to add empty space as a part of a buffer. Just add it in your custom protocol, like we can today. It's not a burden and we save ourselves the need to support some arbitrary empty chunk of space in the format.

krojew avatar May 08 '20 12:05 krojew

You don't need to add empty space as a part of a buffer. Just add it in your custom protocol, like we can today. It's not a burden and we save ourselves the need to support some arbitrary empty chunk of space in the format.

The space does not have to be empty. If you finish a builder without providing the signature, you buffer will look the same as it looks now:

[First vTable Offset] [other meta data] [Data]

So no waste of space. You probably want to ask, but what will happen if I want to verify signature of not signed buffer?

Not a problem as then you are trying to verify part of first say 256 bits of [Data] with the rest of [Data] and it will not match. To be sure that we don't crash we need to check that size of whole buffer minus size of [First vTable Offset] and [other meta data] is bigger than size of the signature.

No custom protocol, no custom code, no arbitrary empty chunk of space in the format.

mzaks avatar May 08 '20 12:05 mzaks

No custom protocol, no custom code, no arbitrary empty chunk of space in the format.

Just a bunch of logic which should be in the protocol.

krojew avatar May 08 '20 12:05 krojew

Just a bunch of logic which should be in the protocol.

Logic which you can rely on as it is battle tested and used by many users. Custom protocol is ... custom. Don't get me wrong I love custom, in some cases I also write custom binary formats. But there is room for holistic and generic solutions. Otherwise we would no be "here" πŸ˜‰.

mzaks avatar May 08 '20 13:05 mzaks

Your proposal doesn't solve any problem which is not already trivially solvable today, yet adds a burden of maintaining it in a generic way. Adding some empty space to a buffer is not a "battle tested" protocol. Appending/prepending a buffer with a signature is not rocket science - no need to wait for it to prove itself. In essence, this idea can be summarized as "let's allow adding arbitrary space to a buffer", which is no different than allocating a bigger chunk of memory by the application and adding custom data. I don't see any benefit at all of doing that. Serialization library (which FB is) should have a single responsibility - (de)serialize data. Any additional features are up to the users.

I see no further point discussing this particular topic. Opinions have been expressed.

krojew avatar May 08 '20 14:05 krojew

  1. Remove 0-termination of strings. Only C/C++ care for this, and C++ has been moving toward string_view recently, and both have been using size_t arguments for a long time rather than relying on strlen. Other languages don't use it. For passing to super-old C APIs that expect 0-termination, either swap the terminating byte temporarily while passing that string, or copy.

To interact with other libraries, zero-terminator is often needed and beneficial (strings can be used in place). To me this change makes sense only if the new format supports both type of strings, i.e. the zero-terminator can be included in the length explicetly.

  1. Support 64-bit offsets from the start. They would be optional for vectors and certain other things, allowing buffers >2GB. See https://github.com/google/flatbuffers/projects/10#card-14545298

Yes please! This means lifting all 32 bit limitations, correct?

ivannp avatar Jun 17 '20 16:06 ivannp

@ivannp I made the exact same objections to zero termination earlier, but I am softening up to the idea. I already have extensive libraries written that all have an _n variant that takes length argument, notably because text parsing often results in unterminated string segments.

If you store the length of the string together with the string offset, which I do on other high-performance contexts, then you can actually compress a lot of strings by referencing existing strings. Also, the length can be determined without a pointer indirection. Overall this should result in size and speed improvements.

For the few cases where you absolutely need a zero terminated string, these are typically slow API's like fopen and here I have given up on static buffer handling of paths since malloc has become fast and path operations are a true pain if you cannot return dynamically updated paths. And for Windows, you have a whole mess with translating from UTF-8 to UTF-16 which also require allocation. Even if the latest Windows server release does have UTF-8 as native codepage, it will be many years before you can rely on that.

So, if all vectors store offset and size together, and unions store offset and type together, and we drop alignment requirements, and we store buffers front to back, I think we are on to something. Strings would be vectors proper.

I'd still like an analysis of how to deal with SQL NULL data consistently. I'd also like to be able to clone a table recursively without knowing its type by annotating vtables. This annotation could both be that it is a table, vector, struct or union. And perhaps also a hint on deduplication of subgraphs. All in a small bitvector.

mikkelfj avatar Jun 18 '20 15:06 mikkelfj

Additonally: vtable offsets to strings, unions, etc. should be a single entry, not two like is currently the case for unions. This just complicates matters.

There is also an issue with JSON encoding of unions where you cannot be sure to see the type before the data because they are separate fields. Even if the type is written first, it is not guaranteed to be the case when the data has been processed, notably because the _type suffix sorts later.

mikkelfj avatar Jun 19 '20 09:06 mikkelfj

@mikkelfj yup, that's #2 in my original post.

aardappel avatar Jun 22 '20 17:06 aardappel

Unaligned access might have problems with modern compiler SIMD optimizations. It would technically require a memcpy of all data even on little endian platforms to make access safe.

mikkelfj avatar Jun 24 '20 13:06 mikkelfj

  1. Support 64-bit offsets from the start. They would be optional for vectors and certain other things, allowing buffers >2GB. See https://github.com/google/flatbuffers/projects/10#card-14545298

Yes that would be great! We are considering flatbuffers as base for our future stuff and this did come up as a limitation as do definitely have serialized game assets larger than 2 GB.

Do like the idea of 64-bit offsets something that one would opt into for these specific large use cases, but otherwise continue use more compact 32-bit offsets

repi avatar Jun 24 '20 14:06 repi

@repi this is also still being considered for the existing FlatBuffers, but it is going to be a significant effort supporting it in all languages.

I am guessing you'd be most interested in the Rust implementation? @rw

I'd be curious to hear what the structure of a typical >2GB game asset would be like, to see if it matches the idea in the link that you'd just need a single vector of 64-bit offset to sub-assets that all fit in 2GB each. I presume the reason for wanting to have it all in a single FlatBuffer would be use with mmap or similar?

aardappel avatar Jun 25 '20 21:06 aardappel

On more thing regarding the offsets. Currently the offsets are relative. Having a relative offset has it's benefits, but I think those benefits are not "used". The offsets are always 32 bit long (not speaking about vTableOffsets here) and AFAIK there are no mechanisms of extracting sub graphs in FlatBuffers. The pitfall of a relative offset is a higher entropy. IMHO this is one of the reasons why FlatBuffers compresses worse than some other formats.

mzaks avatar Jun 26 '20 07:06 mzaks

I use offsets all the time in C. A single pointer into the buffer is sufficient to find the rest. Otherwise you need to keep a base pointer around. The verifier does this because it has to, but other code does not.

mikkelfj avatar Jun 26 '20 10:06 mikkelfj

@mikkelfj was your message in regards to my message or something else :)

mzaks avatar Jun 26 '20 14:06 mzaks

@mzaks you, since offsets were mentioned.

mikkelfj avatar Jun 26 '20 15:06 mikkelfj

@mzaks absolute offsets would not compress any better (probably worse, since they'd be bigger numbers on average than relative offsets).

To get absolute offsets relative to the start of the buffer, we'd also need to be building the buffer forwards, like suggested in 9)

And absolute offsets are worse in many cases, because you need to know the start of the buffer to be able to convert them to a pointer, whereas relative pointers can be converted to pointer just from their current location. The C++ API would need "fat pointers" instead of the naked pointers it uses now to be able to work with absolute offsets.

aardappel avatar Jun 26 '20 16:06 aardappel

And absolute offsets are worse in many cases, because you need to know the start of the buffer to be able to convert them to a pointer, whereas relative pointers can be converted to pointer just from their current location. The C++ API would need "fat pointers" instead of the naked pointers it uses now to be able to work with absolute offsets.

Yeah that makes sense. However the user should keep the pointer to the start of the buffer for memory management reasons anyways. The only problem is that it would have to be "exposed" through all the reading sites.

Now I also understand @mikkelfj better:

Otherwise you need to keep a base pointer around

Regarding:

absolute offsets would not compress any better (probably worse, since they'd be bigger numbers on average than relative offsets)

For a single values that would be true, but a repeating value lowers the entropy. Say we have a shared vTable and 50 tables instances which reference it. With relative offset we have 50 different values, but with absolute we have only one, used in 50 places. A frequent value will be mapped to a smaller representation during compression (see Huffman coding for example).

I developed my own serialisation format for value sequences, which I used to encode CSV files. The result is smaller than CSV and when I compress it with zip it is still smaller than the compressed CSV (tested on different data sets). IMHO one of the reasons why it compresses so well, is because I use absolute offsets.

But as always it is not black and white. As you mentioned translating to absolute offsets has its pitfalls. My claim of better compression needs to be researched for FlatBuffers first.

mzaks avatar Jun 27 '20 08:06 mzaks