flatcc
flatcc copied to clipboard
Add support for external references / mutable interface
It's very convenient to store data in flatbuffers format, linearly load it into memory and just use generated accessors to traverse it. However, there are some cases when this loaded data needs to reference some other data, which can be stored elsewhere. While it's certainly possible to use strings or hashes as these references and manually dereference them when needed, this could be suboptimal when accessing one field many times, and error prone, since accessors must be written manually for each such field. It would be much better if there was a way to traverse such buffer and store pointers to external data, so that they could be used directly. Flatbuffers C++ API does have support for external references in unpack functionality, but for me it seems a bit excessive for the task. Since flatcc project already have some extensions to flatbuffers format I'll try to make a draft proposal for another one :)
- in fb schema add support for field attribute (ref:typename), which will generate hidden field (just like union)
- when generating binary flatbuffers write number of references needed by that buffer just after offset to the root table
- use hidden ref fields to store offset into reference array
- add function to reader interface to allocate array of pointers, traverse flatbuffers data and fill that array using callback in form of void * get_reference( const char * typename, const char * name )
- generate functions to access these reference fields as pointers using table and reference array pointers
Another variation of this is to save enough space for pointers instead of offsets to reference array, this way API would be even more simple (no need for additional allocation, no need for providing additional pointer to reference array to accessor), but the drawback is that we need to make some assumption about size needed to store pointer (will 64 bit be enough for all cases?), and binaries will contain some junk data.
Any thoughts?
P.S. I'd very like to have this feature for my own pet project, so I'm quite willing to implement this in some form, as long as it have a chance to be accepted upstream :)
Any hidden field schema extension is unlikely to be implmented - for the sake of portablility, because unions already are ugly as they are, and because it would probably also need support in JSON which makes it at lot of work, and fragile.
Custom attributes could be supported at some point. The current compiler does not carry attributes through in a clean way to the code generator which makes it difficult. And I would still hesitate for the above reasons.
The risk is that flatcc becomes some sort of custom database on its own which would then be bent to a specific use case - I'd rather keep it as a support layer, transport layer, and abstraction layer for databases.
That said, there is a need for a layer that handles mutations, and there is a very practical need for the current reader api to return a pointer to an element, and not just its value. The pointer would support custom hacks needed for mutations.
I imagine mutations done by a modified reader implementation (might not need a new code generation, except perhaps a facility for holding a context pointer). One way to implement this is to look up all references in a hash table, or to tag modified references and then do a hash table lookup, or some other lookup such as using a different base for the reference. When modifying an element, it would be copied and possibly tagged.
Storing 64-bit pointers in-place in hidden fields is not the way to go - hash lookups or segmented lookups are fairly fast and if it really is needed, flatbuffers can be compiled with 64-bit references (but untested so far).
Allocation in the reader interface is also out, since the entire concept is direct raw access with no management of anything. A mutable layer on top, would be different, and it would likely be a separate library.
The biggest challenge in the above is how to handle a context pointer without changing the entire reader interface and without using global pointers. It is needed for things like identifying the hash table in which to perform a lookup. Instead of returning pointers to elements, it could be returning a struct holding context and pointer/reference.
This is all actually very complex to do in a clean way that is future proof and not something I am prepared to do on a whim, although I have been given this quite some thought from time to time.
There is also a need to consider interop with the possible future schemaless flatbuffer concept.
On context objects, I think there would be certain pluggable accessor methods that accept a context pointer. These accessors can the be activated via a define.
Thanks for detailed answer, it gave me some different new ideas.
-
When going inplace pointers route there is really no need in additional hidden field. Since tables already have vtables containing field offsets, it's possible to leave space between field data. So, builder can leave this space for fields tagged as (ref) in schema, normal reader interface won't need any change, and yet it will be quite easy to generate efficient reader/writer functions for those inplace pointers, as well as recursive "reference resolving pointer storing" function.
-
Add mutable reader interface supporting context objects. Since they won't be part of standard reader I think it's possible to allow them to allocate memory. This will make it possible for following features:
- mutate fields which are not stored in original flatbuffer object
- mutate nonscalar fields which currently not possible even with original C++ implementation
- implement pointer caching for "reference" fields
Regarding adding secret fields, this is probably a bad idea since it either locks in the design or risk getting broken by future changes. However, the emitter can be customized and it can get out of band type information and allocate extra data. Out of band happens by the user code calling the builder api first setting a type field in the custom emitter object. The emitter then knows to handle the next emitted object specially.
As to more general mutation allocation, this need to be carefully designed to make it work on a variety of platforms like the builder has a pluggable low-foot print pluggable allocator.
Regarding in-place pointer tagging, it is preferable to do this in a way that does not make the interface vulnerable to malicous field overlaps. This is best done by not modifying the original buffer at all, but references cannot overlap other references due alignment requirements checked by verify so modifying a reference can at most affect structs, string content, and scalars which is structurally safe. Therefore a reference can be tagged in the lower two bits, or it can be placed outside the valid buffer, e.g. by setting the most significant bit. Either way, this can be checked in a single comparison.
It may still be preferable to not touch the original buffer at all. For example when memory mapping large files (possible containing multiple buffers) you do not want to write back data, nor do you want to disturb the main memory cache more than necessary. This does, however require a hash table check of the reference, or a similar check (e.g. a 4 byte to one bit bitmap).
Regarding scalar updates etc. these should never be update in-place because it is unsafe, and not needed with a mutable interface.
After some more thinking I came up with one more idea how to approach buffer mutation. Given that original buffer should be immutable, and the fact that flatbuffers format allows to leave place for some additional data, why not implement something like buffer inheritance? When we need mutation functionality we just allocate dynamically growing buffer and store in some out of band, but well defined area pointer back to original data. Now accessing data becomes something like:
- check for data in mutable buffer using usual access methods
- if data not found go through pointer back to original data and check there
Mutating fields becomes search-modify/add to dynamic buffer. Security concerns are not an issue here since we fully control our new growing flatbuffer. Also this idea could be extended to implement versioning (or just avoid reallocations) through flatbuffers chaining. Of course, there are lots of details that need to be worked out.
I would not call that buffer inheritance - there are good reasons for flatbuffers to not support inheritance since it locks in the schema making change difficult.
But runtime inheritance, if we can call it that, is exactly what I had in mind. It's just that there is a need to decide where to start looking, and this is where hash tables and pointer tagging comes in. You could also cache the exact item location in a limited cache if it is worthwhile, and if you can manage dirty caches efficiently. But these are implementation details. The real challenge is how to design the API in a clean way, and all the extra work with rebuilding/compacting clean buffers from mutations, deciding when compaction should be triggered automatically, avoiding excessive extra code complexity etc.
I'm keeping this open as a basis for future discussions regarding a mutable interface. The title as a bit misleading though, given how the discussion unfolded.
@skhoroshavin On the discussion of inheritance, there is now a solid proposal for doing so via mixin's: https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md#mixins
It would issue an error if there are conflicting names, and it wouldn't directly solve the custom attribute issue, but it might be a starting point. That said, there are currently no plans to implement this, but it is considered to be a solid possible future design.
@mikkelfj thanks, I've read a proposal and it looks good for inheritance, but I fail to see how does it help to address mutability. Any hints?
Not really, but I have this idea of using a piece chain data structure to overlay a static buffer, so you can sort of edit it out of place. It would require a modified reader.
https://github.com/martanne/vis/wiki/Text-management-using-a-piece-chain
When you build a new buffer, the partial objects are available as a negative offset to buffer end. Ignoring that the memory is not contigous (can be fixed), you can convert this offset to a readable object. With a piece chain you can store which buffer and which offset the reference belongs to.
So you can have an original buffer and one or more new buffers. When done, you can compile this into a single consistent buffer. This needs a clone method for tables which is not currently available, but it is something that is needed. Note that clone is tricky because a table is really a DAG, but so is a piece chain.
One could also add deduplication at the same time.
Thanks for pointing at piece chain article, was definitely worth reading. Now back to mutable interface - I've got an impression that when implementing it using piece chains mixins or any sort of inheritance is not needed, did I get it correctly?
Yes, I think they are largely unrelated. It may also be that a hash table is better than a piece chain, but the overall principle remains.
References:
@skhoroshavin Cloning for tables, vectors and unions is now implemented (also pick to clone one field from table to table of same type) is now implemented and scalar fields have a _get_ptr. This might be useful for copy on write hack. Perhaps there should also be a _get_ptr for table etc. pointing into where the field reference is stored.
@skhoroshavin you are probably not working with flatcc anymore, but an update on this: Meanwhile I have been having the same idea after long forgetting about this topic.
The use case is having flatbuffers implement an AST for user implemented parser. In this case you typically want to link symbols to additional data. The alternative is to use either hash table or a linear table the same size as the original buffer to store symbol references, or to manually add reference fields in the schema.
In the end I think it would be too clunky to generate fields, and in terms of implementation, the union type which is a similar concept (two fields for one) already adds much code complexity and bug potential. And finally it would require casting non-mutable fields to mutable, or support mutable fields.
In the end the better approach is simply a hash table. And if for performance reasons that is not good enough, simply modify the schema either manually, or with a script using specific triggers like a custom attribute to insert the fields.
Hence I'm closing this as a won't fix.