baseline
baseline copied to clipboard
Research binary data encoders for baseline messages
Overview
In order to increase the messaging efficiency and enforce message schemas in baseline environments, we should research the use of binary data encoders. Advantages of serialized binary data encoders over JSON:
- makes it easy to enforce message schemas to ensure different systems can properly parse messages
- serialized binary data format is more memory efficient than JSON because of less overhead
- broad language compatibility
We need to research the following options to determine which one best fits our needs and is it worth implementing in the baseline codebase:
- Protocol buffers (Google)
- Apache Thrift (Facebook)
- Simple Serialize aka SSZ (Ethereum)
Questions
- What is the timeline for adding binary serialization to the baseline packages?
Tasks
- [ ] Compare the serial binary encoding options and report findings
- [ ] Decide whether to use on of the binary encoding options or stick with JSON
- [ ] Decide what priority this work should have on the baseline roadmap
ASN.1 BER/PER is pretty standard.
for SSZ, my reference is https://github.com/ChainSafe/ssz
Messaging Protocols Comparison
The following contains research notes created by @Perseverance and @tkstanczak
Custom protocol
Documentation
No documentation - will have to provide it ourselves
Overview
Creating our own custom messaging protocol, defining the layout of parameters in bytes array.
Pros
- Super efficient - absolutely no overhead from the protocol
Cons
- Non standard - we have to provide documentation and ask the implementer to write it’s own encoder and parser.
- Have to write and maintain documentation.
- Have to write your own logic for field size description/limitation
JSON
Documentation
https://www.json.org/json-en.html
Overview
JSON encodes the data in key-value pairs adhering to the main JavaScript types. It is wildly used as it is very easy to serialize and deserialize.
Pros
- Widely adopted - there is a parser/encoder in every remotely popular language
- Easy to read
Cons
- Massive overhead due to keys
- Lack of support for some types - ex. Binary. Hacks (Base64) need to be implemented to support them.
Protocol Buffers
Documentation
https://developers.google.com/protocol-buffers
https://developers.google.com/protocol-buffers/docs/proto3
Overview
Serialization mechanism developed by Google. Utilizes a schema/model definition that gets compiled to the corresponding models in the corresponding language. The definitions allow for model nesting and packs data by default. The definition language resembles standard model definitions that are seen in technologies like GraphQL
Pros
- Compiled to binary but abstracted away through compilation
- Very efficient and optimal
- Support for many languages
- Support for field depriction and protocol upgrades (by adding new fields)
Cons
- Needs to be compiled
Thrift
Documentation
Overview
Apache Thrift is not just a messaging protocol but also generator for client and server applications based on schema. It allows defining the message types and the business services that would be available.
Pros
- Supposedly super optimal and fast
Cons
- Hard language to develop
- Not specific for messaging protocols
SSZ
Documentation
https://github.com/ethereum/eth2.0-specs/blob/dev/ssz/simple-serialize.md
Overview
SSZ is a compact data encoding used in Ethereum 2. It defines Merkleization mechanics for any object structures, defines a compact, binary method of serialization for objects.
Pros
- Used in Ethereum, with implementations in Rust, Python, C#, Java, JavaScript.
- Compact
- Crypto friendly
- Allows to skip items and get to the exact position to read only some data and discard everything else (faster parsing)
Cons
- Limited tooling as it is only used for Ethereum 2 at the moment.
- Not human readable (but core dev readable via hex ;))
Suggestions
- I feel that Protocol buffers are probably our best choice for the moment.
- I feel that SSZ is better for crypto / Ethereum space and it should have even better speed than protobuff. But the tooling for enterprises might be less friendly.
Why not Avro? Kafka support is already integrated correct?
On Thu, Aug 13, 2020 at 12:28 PM Samuel Stokes [email protected] wrote:
Messaging Protocols Comparison
The following contains research notes created by @Perseverance https://github.com/Perseverance and @tkstanczak https://github.com/tkstanczak Custom protocol Documentation
No documentation - will have to provide it ourselves Overview
Creating our own custom messaging protocol, defining the layout of parameters in bytes array. Pros
- Super efficient - absolutely no overhead from the protocol
Cons
- Non standard - we have to provide documentation and ask the implementer to write it’s own encoder and parser.
- Have to write and maintain documentation.
- Have to write your own logic for field size description/limitation
JSON Documentation
https://www.json.org/json-en.html Overview
JSON encodes the data in key-value pairs adhering to the main JavaScript types. It is wildly used as it is very easy to serialize and deserialize. Pros
- Widely adopted - there is a parser/encoder in every remotely popular language
- Easy to read
Cons
- Massive overhead due to keys
- Lack of support for some types - ex. Binary. Hacks (Base64) need to be implemented to support them.
Protocol Buffers Documentation
https://developers.google.com/protocol-buffers
https://developers.google.com/protocol-buffers/docs/proto3 Overview
Serialization mechanism developed by Google. Utilizes a schema/model definition that gets compiled to the corresponding models in the corresponding language. The definitions allow for model nesting and packs data by default. The definition language resembles standard model definitions that are seen in technologies like GraphQL Pros
- Compiled to binary but abstracted away through compilation
- Very efficient and optimal
- Support for many languages
- Support for field depriction and protocol upgrades (by adding new fields)
Cons
- Needs to be compiled
Thrift Documentation
https://thrift.apache.org/ Overview
Apache Thrift is not just a messaging protocol but also generator for client and server applications based on schema. It allows defining the message types and the business services that would be available. Pros
- Supposedly super optimal and fast
Cons
- Hard language to develop
- Not specific for messaging protocols
SSZ Documentation
https://github.com/ethereum/eth2.0-specs/blob/dev/ssz/simple-serialize.md Overview
SSZ is a compact data encoding used in Ethereum 2. It defines Merkleization mechanics for any object structures, defines a compact, binary method of serialization for objects. Pros
- Used in Ethereum, with implementations in Rust, Python, C#, Java, JavaScript.
- Compact
- Crypto friendly
- Allows to skip items and get to the exact position to read only some data and discard everything else (faster parsing)
Cons
- Limited tooling as it is only used for Ethereum 2 at the moment.
- Not human readable (but core dev readable via hex ;))
Suggestions
- I feel that Protocol buffers are probably our best choice for the moment.
- I feel that SSZ is better for crypto / Ethereum space and it should have even better speed than protobuff. But the tooling for enterprises might be less friendly.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ethereum-oasis/baseline/issues/192#issuecomment-673667704, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH2D4LCYREIIN4WLSWZPJXTSAQ5FRANCNFSM4PTMLGGA .
@Kasshern @skosito @Ybittan @biscuitdey This discussion might be relevant for our work on the SRI. Keeping it open so that we can discuss.