specs
specs copied to clipboard
ipld performance and suggestion to improve it
When serialising/deserialising blocks we currently suggest an intermediate representation, the ipld data model. This has the advantage that generic ipld tools can understand and work with this representation (eg selectors/graphsync). The disadvantage is that it is inefficient. I propose designing a new codec (I'll be working on this) that similarly to the parity scale codec exploits knowledge of the structures for fast serialisation/deserialisation but like cbor also contains metadata to allow generic tools to parse it into an ipld representation.
As an example: The following struct could have two potential ipld representations:
struct A {
field1: bool,
field2: u32,
}
{ "field1": true, "field2": 1}
{ "field2": 1, "field1": true}
To be able to parse both these representations into the struct we need to have an intermediate representation where it is parsed into some kind of hash map and then each field is extracted from the hashmap.
This new codec would enforce the first representation for fast serde. And it's probably just cbor with alphabetically sorted keys...
So my benchmark on adding 1000 elements to a ipld vec only improves by 10%. This is mainly because it is not a recursive data structure. For the hamt I'm sure it would have made a much larger difference, but I haven't implemented it yet. The basic idea is that the ipld data model is a read only view for protocols built on ipld, but most applications want to go straight to and from cbor.
Can we extend the dag-cbor spec to require keys to be sorted alphabetically in a map?
Can we extend the dag-cbor spec to require keys to be sorted alphabetically in a map?
I don't think we need to introduce additional tags for dag-cbor. Would it be enough to just specify one specific order when data is encoded as dag-cbor?
yes that's what I meant. It will make implementations in languages that don't have a macro system slower, but in rust or scheme we can generate code and do the sorting at compile time.
It's only necessary on structs, where the keys are known in advance. Maybe it's the schema that should specify the ordering of the struct representation. From the spec it's not clear if key order matters or not. It reads like it doesn't.
I already have to force a consistent key ordering in dag-json in JavaScript. It sucks but it’s necessary.
After thinking some more about this I'm adding a couple of notes for future reference. Ideally a codec would have the following properties:
- binary
- maintains forwards/backwards compatibility like protobufs
- is self describing for applications/protocols using the ipld data model
- this path can be slower and possibly RO
- the metadata can come possibly after the data and can be ignored on the critical path
- is canonical
FWIW, a lot of what you describe seems to also fall within the mission of the new IPLD Schemas work, especially as we get to working on codegeneration driven by the schemas. :)
One of the big features we want to unlock is indeed generating code which uses native (for the language of choice) structures (especially where this means a compiler will generate assembler with fixed offsets, etc), and can have no maps needed as intermediates in deserialization; or, even higher-specialization, codec code which operates directly on the native structures with no intermediate abstractions.
The former of those two features should already have speed gains; the later requires even more volumes of generated code (so it's not without tradeoffs!) and would be specific to a single codec (again, tradeoffs) but would tend to be about as fast as it can get.
I'm working on this now for golang. Other languages will also need their own efforts :)
Maybe it's the schema that should specify the ordering of the struct representation. From the spec it's not clear if key order matters or not. It reads like it doesn't.
Yeah, that's currently conspicuously absent from the specs, but agreed that it's worth thinking about.
There seem to be user stories both 'for' and 'against' on this kind of strict ordering... so perhaps one way of cutting that gordian knot will be simply making codegen tools which generate code either with or without strictness. The less strict code will handle more inputs; the stricter one will be slightly faster and use somewhat less memory. Both might be useful.
(I tend to err on the side of thinking strict order is preferable for binary encodings; but then imagine if you used JSON input and it was order-strict; most people will be annoyed by this most of the time, I suspect.)
maintains forwards/backwards compatibility like protobufs
I agree with this in general vision, but also want to shoot a wee hole in one detail. :) Backwards and forwards compatibility are good. I don't know if I think protobufs represent the pinnacle of development on that regard, though. The years-long debate on whether or not required fields are socially acceptable in protobufs was a wild ride; and I'm not at all sure I think the current everything-optional no-field-can-be-regarded-unacceptable thinking sounds very good to me either. Broadly, I'd say the experience of protobufs seems to be that while they can go backwards and forwards in compat... it only seems to work when versions evolve somewhat linearly; forking a protobuf protocol basically doesn't work in a community sense (it tends to have a "fork-only" UX), and this is a serious limitation. Backwards and forwards compat is good, but it's a very complex topic, and I wouldn't take all my cues directly from protobuf without a grain of salt.