tractor
tractor copied to clipboard
Alternative interchange formats
The list I've been meaning to look through/support:
- plain old
json
(if peeps want it overmsgpack
) - blosc
- apache arrow
- of course protocol buffers from google
-
capnproto thx to @salotz for pointing out this one - there's a well maintained python package
pycapnp
Maybe more?
We'll need to abstract the channel API to take in different types of stream types. This work will require coordination for the alt transport work in #19.
Here is an extremely good write up on the shortcomings of pandas
from the original author with links to many other great resources.
Apache arrow seems to very much be a solution to many of the prior memory constraint and inter-process ailments of big data with pandas
. I haven't dug too much into recent developments but this article seems like a good entrypoint.
Anyone wanting to take a look at the ipc section in pyarrow
might be able to get something cool going quickly!
You might be interested in this as well: https://github.com/real-logic/aeron
and the binary encoding it uses: https://github.com/real-logic/simple-binary-encoding
Designed for extremely low latency trading systems. There is a C++ implementation, and there is no python interface atm though. Not sure exactly what sauce they are using that is better than say, CapNProto.
All of them are probably useful in different situations. Which complicates things..
Blosc AFAIK is just a compression algorithm. Still useful, and can be used transparently (would require intelligence about when data is moving over I/O), but perhaps should be a user level thing. My suspicion is that Arrow has compression specifically accounted for, although I don't know.
For the sake of interestingness, although its likely of no use to use is: https://kaitai.io/
Also #8 mentions msgpack-numpy
.
While not a new interchange it is a system worth comparing against when considering alternatives.
Interesting historical format SBE - simple binary encoding that's (was?) used in financial systems.
The end result of applying these design principles is a codec that has ~16-25 times greater throughput than Google Protocol Buffers (GPB) with very low and predictable latency. This has been observed in micro-benchmarks and real-world application use. A typical market data message can be encoded, or decoded, in ~25ns compared to ~1000ns for the same message with GPB on the same hardware. XML and FIX tag value messages are orders of magnitude slower again.
The sweet spot for SBE is as a codec for structured data that is mostly fixed size fields which are numbers, bitsets, enums, and arrays. While it does work for strings and blobs, many my find some of the restrictions a usability issue. These users would be better off with another codec more suited to string encoding.
Sounds like it would need to be compared with capnproto
- haven't dug into any libs yet.