tractor icon indicating copy to clipboard operation
tractor copied to clipboard

Alternative interchange formats

Open goodboy opened this issue 5 years ago • 5 comments

The list I've been meaning to look through/support:

Maybe more?

We'll need to abstract the channel API to take in different types of stream types. This work will require coordination for the alt transport work in #19.

goodboy avatar Feb 17 '19 15:02 goodboy

Here is an extremely good write up on the shortcomings of pandas from the original author with links to many other great resources.

Apache arrow seems to very much be a solution to many of the prior memory constraint and inter-process ailments of big data with pandas. I haven't dug too much into recent developments but this article seems like a good entrypoint.

Anyone wanting to take a look at the ipc section in pyarrow might be able to get something cool going quickly!

goodboy avatar Mar 01 '20 19:03 goodboy

You might be interested in this as well: https://github.com/real-logic/aeron

and the binary encoding it uses: https://github.com/real-logic/simple-binary-encoding

Designed for extremely low latency trading systems. There is a C++ implementation, and there is no python interface atm though. Not sure exactly what sauce they are using that is better than say, CapNProto.

All of them are probably useful in different situations. Which complicates things..

Blosc AFAIK is just a compression algorithm. Still useful, and can be used transparently (would require intelligence about when data is moving over I/O), but perhaps should be a user level thing. My suspicion is that Arrow has compression specifically accounted for, although I don't know.

salotz avatar May 29 '20 20:05 salotz

For the sake of interestingness, although its likely of no use to use is: https://kaitai.io/

salotz avatar May 29 '20 21:05 salotz

Also #8 mentions msgpack-numpy.

While not a new interchange it is a system worth comparing against when considering alternatives.

goodboy avatar Jun 05 '20 17:06 goodboy

Interesting historical format SBE - simple binary encoding that's (was?) used in financial systems.

The end result of applying these design principles is a codec that has ~16-25 times greater throughput than Google Protocol Buffers (GPB) with very low and predictable latency. This has been observed in micro-benchmarks and real-world application use. A typical market data message can be encoded, or decoded, in ~25ns compared to ~1000ns for the same message with GPB on the same hardware. XML and FIX tag value messages are orders of magnitude slower again.

The sweet spot for SBE is as a codec for structured data that is mostly fixed size fields which are numbers, bitsets, enums, and arrays. While it does work for strings and blobs, many my find some of the restrictions a usability issue. These users would be better off with another codec more suited to string encoding.

Sounds like it would need to be compared with capnproto - haven't dug into any libs yet.

goodboy avatar Apr 01 '21 16:04 goodboy